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Abstract 


This work addresses two related questions. The first question is what joint time- 
:requency energy rep resen tatlons are most appropriate fur auditory signals, in par¬ 
ticular, for speech signal* in sonorant regions. The quadratic transforms of the 
signal are examined, a largo class that includes, for example, the spectrograms and 
the Wigner distribution. Quaai-statipnarity is not assumed, since thin would neglect 
dynamic regions. A set of desired properties is proposed for the representation: (1) 
shift-invariance, ( 2 ) positivity, {3) superposition, (4) locality, and (5) .smoothness. 
Several relation* among these properties are proved: Rhift-invariancc aod positivity 
imply the transform is a superposition of Spectrograms; positivity and superposition 
are equivalent conditions when the transform LB real; positivity limits the simulta¬ 
neous lime and frequency resolution (locality) possible for the transform, defining 
an uncertainty relation for joint time-frequency energy representations; and local¬ 
ity and smoothness tradeoff by the 2-D generalisation of the classical uncertainty 
relation. The transform that best meets these criteria is derived, which consists 
of two-dimensionally smoothed Wignex distributions with (possibly oriented) H-D 
gaussian kernels. These transforms are then related to time-frequency fj/taring, a 
method for estimating the time-varying 'transfer function 1, of the vocal tract, which 
Is somewhat analogous to cepstral filtering generalized to the time-varying case. 
Natural speech examples are provided. 


The second question addressed is how to obtain a rich, symbolic description of the 
phonetically relevant features In these time-frequency energy surfaces, the so-called 
scheimtft spectrogram. Time-frequency ridges, the 3-D analog of spectral peaks, 
are one feature that ia proposed. If nan-oriented kernels are used for the energy 
representation, then the ridge tops can he identified with zero-crossings in the inner 
product of the grad lent vector and the direction of greatest downward curvature. 
If oriented kerned* are need, the method can be generalized to give better orien¬ 
tation selectivity (e.g,, at intersecting ridges) at the coet of poorer time-frequency 
locality. Many speech examples are given showing the performance for some tra¬ 
ditionally difficult cases: scmi-vcwels and glides, nasalized vowels, consonant-vowel 
transitions, female speech, and imperfect transmission channels. 
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Chapter 1. 
Introduction 


In order to perceive speech and other sounds, the incoming sound wa^ must be 
transformed into tt variety of representations, each bringing forth different aspects 
of the signal, its source, and meaning. Understanding bow we perceive and how 
machines can be made to perceive auditory signals meana, in part, discovering 
appropriate representations for the signals and how to compute them. For many 
kmds of Bounds, tittle is known in this respect. What auditory features, for example, 
wilt distinguish a knock at the door from a footstep? 

For speech signals, more is thought to be known, A phonetician wiii tell you, for 
example, that the /ae/ in had can bo distinguished from the /i/ in bead by the 
location of characteristic peaks in their respective spectra. He could even train you 
to identify a wide variety of phonetic elements by looking at their spectrograms. 
Formalizing this knowledge, however, so that a computer can do this well {In a 
general setting) has proved hard. 

An analogy may explain why, I could train you to distinguish a Mercedes from some 
other car easily; I Would just describe the hood ornament, f To train a machine 


t I thank Mark Libtrtuar far this exSTripti:. 
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to do this task would be much harder. Not only would 1 have to describe the hood 
ornament, hut [ would also have to provide all the visual abilities that 1 take for 
granted with a human —- finding edges and boundaries, recognizing closed forme, 
etc. I believe the failure to correctly provide the corresponding auditory abilities 
— finding spectral Speaks’* and temporal dsscontinuites, recognizing continuous 
forens, etc. — is an important reason why the speech recognition problem has been 
so difficult. 

This problem is in some ways even harder than visual analysis. In vision* It is dear 
that the two-dimensional image is a natural star Ling point. In audition, a similar 2D 
representation is important, with time along one axis and frequency along the other. 
But how should this Idea be made precise (the wdJrkncwrn uncertainty principle of 
fourier analysis is one of the thorny issues involved)? Should we use the conventional 
spectrogram, the Wigner distribution, a pseudo-auditory spectrogram, or something 
entirely new, and how should this decision be made? 

In vision, the notion of edges, lines, and eo forth obviously are important features 
of an Image. Tn audition, it is harder to decide what are the appropriate primitive 
elements. Can some symbolic description summarize the relevant features of a 
sounds time-frequency representation analogous to how a line drawing summarizes 
an image? 

These questions about the early steps in auditory processing are the topic of this 
thesis. The emphasis will be on speech signals primarily because the intermediate 
goals to which the initial computations must aim are better understood. I believe, 
nevertheless, that many of the auditory processing issues discussed here are also 
relevant for other kinds of sounds. 
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The topic as stated is etill too broad. Speech and other signals are made up of many 
different kinds of component?. For instance, speech has fairly smoothly changing 
vocalic regions that are quite different from the more discontinuous: structure of 
consonantal regiop.R. It is unlikely that the same initial representations will be 
appropriate for every kind of signal. The emphasis here will be an signals like those 
found in the more continitons, sonorant regions of speerh- 

In the sonorant regions, WO find an apparent feature is local spectra? energy con¬ 
centrations that vary in center frequency with time. These peaks are due, in part, 
to the “resonances” of the vocal tract - the so-ealted formants. The formant loca¬ 
tions {labelled Fl,F2 v ,, on order of increasing frequency) specify the general vowel 
quality, r-coloring and round ness, white the formant transitions between consonants 
and vowels play an important role in consonant identification {see e,g, Chiba & Ka- 
jiyama 1941; Fanl I960:; Liberman, et al 1954; Ladcfoged 1975). A, Liberman, in 
fact, claims that r ,the second formant transition,, ,is probably the single most im¬ 
portant carrier of linguistic information in the speech signal [Liberman, et. al 1967), 
Thug, restricting the discussion to these regions is by no means uninteresting. 

The initial speech processing envisioned here has been ■divided into two steps. The 
first step, which produces a joint fcjmedrequency representation of the signal energy, 
is explored in Chapter 2 and Chapter 3. The second step, which produces a symbolic 
represen Cation that captures the acoustically relevant features present in the joint 
time-frequency energy representation, is explored in Chapter 4 (see Figure l.l). 

One of the most difficult problems in deriving the form of such representations jg 
deciding which properties nr axiom* to assume at the outset. If strong assumptions 
ire made about the received signal, then rigorously defined optimal detection can 
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Figure I I, The initial speech processing is seen as divided into two steps, (a.) The 
ffrst step represents the signal energy *5 joint functions of time and frequency, (h) 
The Second step builds a symbo/ic representation of the signiRe&nt features present 
in the joint time-frequency energy representations , At this step „ which we calf the 
schematic spectrogram, there is no undue commitment to the acoustic origin 
of the features represented; it h a description of the signal f not its sources, (c.) 
In subsequent processing, these irritiai dcscrip tions ran be used to decompose the 
Bign&f into its acoustic sources. 
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result. For example* if we assume that the received signal consists solely of a 
known signal in additive Gaussian noise, then we could build a matched filter that 
performs optimal Bayesian detection [c.g., see Van Trees 1Q6B]. The disadvantage 
of such Strong assumptions 5a that they are seldom universally valid for natural 
perceptual signals. 

On the Other hand* weaker assumptions made about the received signal can be Com¬ 
bined with assumptions about the design of the representation, things like linearity, 
continuity, locality, and stability* that can result in a solution jcf. Marr & Nishd- 
haral. These design criteria are chosen not on the basis of a specific signal model, 
but instead as reasonable choices that should be appropriate for a wide range of 
natural Signals. The disadvantage of this approach is that the justification of the 
design decisions is more intuitive and abstract. 


Tn the best of circumstancee, the two approaches would result in the same or similar 
solutions to a problem. Thus the auditory processing would perform optimally {in 
different senses) when both appropriate weak and strong assumptions are made 
about, the received signal. 

Chapter 2 derives those joint time-frequency energy representations that satisfy 
a small set Of desirable properties; these properties are intentionally kept quite 
general. Chapter 3 re-examines this problem in a more specific setting. Given a 
(time-varying) model of speech production, what time^frequdney representation of 
the signal best depicts the ‘transfer function’ of the vocal tract while suppressing 
the excitation. These two approaches, in fact* yield similar solutions. 

In the initial part of Chapter 4* a general* heuristic argument is used to produce a 
phonetically relevant, symbolic representation of the signal. In a later part, these 
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solutions art; briefly related to a signal detection model. 

In Chapter 5„ we look at a wide range of examples using LIiees proposed methods, 
We examine some traditionally difficult speech cases — glides and semi-vowels, 
nasalized vowels, consonant-vowel transitions, female speech, and imperfect, trans¬ 
mission channels, 

IVhBf? For fixe figures in tin's thesis, time is in seconds, frequency iti 
fieri*, anti energy r in decibels, unless otherwise indjeafed. 




Chapter 2 


The Time-Frequency 
Energy Representation 


Thia chapter explores the design of joint time-frequency energy representations for 
Speech signals. A set of desirable properties for such representations to satiafy is 
proposed, and the relationships among these properties is discussed. This includes 
a genera! treatment of the Uncertainty' relations that arise, The signal transforms 
that best satisfy these properties me then derived and examined. 


2*1* The stationary case 

We begin with an analysis of the special case of stationary signals. There is a large 
literature for this case; Rabinar & Schafer [l97fl| and Flana K an [1972] provide good 
reviews. The discussion of it hare Lr very condensed! and confined to topics that art 
relevant to the sequel. 

A Stationary signal is used here to roughly mean a signal whose frequency content 
does not vary with time. More preclsdy 1 we consider only determinstic signals that 
are periodic and random signals that are correlation-stationary. For both kinds 
of Signals, the power spectrum, the fcuricr transform of the autocorrelation fnne- 
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lion, captures naturally the energy present at each frequency, "f Time la removed 
from this representation; the power spectrum is a one-di mensional representation 
of energy a5 a function of frequency, 

Fdt speech signals there are, of course, no completely stationary signals- We can, 
however., deliberately utter vowels so that they are steady-state for as long as we 
like. Figure 2-1 Khows the spectrum of a long duration, voiced /j/. We find in the 
spectrum many of the characteristic features of a steady-slate vowel. 

Let us examine the spectrum in Figure 2.1, Note the y-asis is logarithmic to com¬ 
press the wide dynamic range of the speech. At a fine scale in this spectrum, there 
are peaks spaced about every hundred Hertz; these are the harmonics of the pitch. 
The somewhat larger scale peaks, nf a few hundred Hz bandwidth, are the Formant 
peaks. The peak at about 300 Hjs La FI and the peak at about 2300 Hz is F2, which 
is characteristic of an f\( vowel for an adult male. Still larger scale shaping of the 
spectrum, so called spectral balance, is due to the formant locations, the nature of 
the voicing and the transmission channel. 

The spectral structure of A vowel, there Fore, Is due acoustically to several factors: 
(1) the vocal excitation ■ ■ e.g., voiced; (2) the vocal tract transfer function, char¬ 
acterized by Its resonant frequencies — the formants, and (3) the transmission 
characteristics — e.g., room acoustics. Determining these fActors from the speech 
(i.e., finding the formant frequencies, the pitch, etc,) Is an important intermediate 
step in speech analysis, since they decompose the signal into components of nearly 
independent origin, and are (thus) starting points for the phonetician's description 
of speech signal. 


\ FW a delrrnnmutii: signal x|j!|. ’La aulDCH^rreliitioit t" j r.■:Li :jl d j z|i-l- t]t" |i] sit, aiid fell 1 a BV-ii.[.[.uijii:y 
random prcjeKsu j, - 1dJ. lL a imtucarrcluitiDci function is -4- 7 iy + (f| . 




$2-J- The Stationa r y case 


r? 



I'l^ure 2.|. r Short-time log spectrum of a steady-state ///'. The finest scale Struc¬ 
ture corresponds to the harmonics of the pitch, spaced ahuUt every ]00 Hz. At an 
intermediate scale are the formant peaks; e.g. t FI at 300 Hz and F2 at 2300 Hz, At 
fhe largest scaJe h the overall spectral balance. 



Figure 2.2* Spectrum in Figure 2.1 smoothed to suppress the excitation. (a) 
Log spectrum convolved with gaussian fcepstral smoothing)* (h) Power spectrum 
convolved with g&usskn (and then transformed to a fog scale). 
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A key point in separating these factors in I he speech signal is ih&t they operate 
at somewhat different scales in its spectrum:; the fine scale structure is due mostly 
to the excitation, while the intermediate scale structure is due to the vocal tract 
transfer function. A common technique for selecting a stale of interest is to smooth 
the spectrum by linear convolution or equivalently, to window the fourler transform 
of the spectrum- The Courier transform of the log spectrum is called the repstrijm, its 
dimension quefrendes, and the smoothing performed cepstrad smootJung 1 or liftering- 
jOppenheim 1969; Oppenheim &. Shafer 1975]., Figure 2.2a shows the spectrum in 
Figure 2.1 after it has been cepstrally smoothed at a scale to emphasise the formants, 
and suppress the excitation. We ehal] see in Chapter 3 that this operation, in fact, 
effectively separates excitation from t ransfer function in certain idealized, stationary 
cases. 

It is smoothing the power spectrum, not its logarithm, that mast easily generalises 
to the lion-stationary ca.se Jater. We will therefore select our scales of interest by 
smoothing the power spectrum instead, or equivalently, by Windowing its fourler 
transfarm, the autocorrelation function. Figure 2.2b shows the spectrum in Figure 
2.1 after it has been thus smoothed, f 

What should the form of the convolution kernel In this smoothing operation be? 
A desirable smoothing kerne! would have good locality (or resolution] for a given 
amount of smoothing. In other words, it would have small duration for the given 
duration of its transform. These two duration* are related by the uncertainty prin¬ 
ciple: given a function h[x) With fourler transform H (*), if the variance of h(i)| 3 !s 
(Ajp) s and the variance of ]/f (^)| 3 is (As)*, then Ar As > £ jBmcewell 1975]. Marx 
(Si Hildreth [1980] proposed a gaussian smoothing kernel (in a vision task) because 


* Empirically, power and lo»g ajiiuudi iji 3 uflen pro<l hap ^irnLLir r-fiBUltfl. 
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Et 13 the unique shape that meets the uncertainty principle With equality. 


2*2* The quasi-stationary case 

The previous flection examined the analysis of stationary speech, signals. No real 
speech signal, of course, is purely stationary. If the frequency content of a signal 
varies slowly with time, however, there is a simple extension of the previous results. 
The idea is to examine the signal aver a short duration window. Given a signal z(t] 
and a, window p(i), the short-time power spectrum at time i is 



Considered as a two-dimensionai function of time and frequency, this signal repre¬ 
sentation is called a spectrogram. Many different window shapes have been usedi 
they typically are symmetric, unimodai, and smooth, e.g,, a gaussian or a raised 
single period of a coEme. 

Signals for which a window can be found whose duration is long enough to allow 
adequate frequency resolution, but short enough to allow adequate time resolution 
are called qoASisUtsonvy, The example of the previous section was, in fact, a 
quasi-statioriitry vowel. Virtually all speech analysis methods in the past depend 
on the quasi-st atiunary assumption. 

2.3. Nern-etationarity 

1 here do exist signals for which no window duration is Adequate. A very simple 
such Signal is the linear chirp, e 1 'F ,M,a , whose instantaneous frequency incrcasea lin¬ 
early with time. The quasi-stationary assumption breaks down for aufficently targe 
modulation slope m of the signal. Let us examine this claim. 
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By the uncertainty principle, the product of the- time duration At and the frequency 
duration (bandwidth] Au; of a window is bounded! below by 1/2. The window 
duration and bandwidth, in turn, determine the time and frequency resolution, 
respectively, in the short-time spectra, f In other words, if the window duration 
is too smalt, then the frequency resolution wilt be poor and if the window duration 
is too tong, the time resolution will be poor. Further, fur a nou-stationary signal, 
poor time resolution can also mean poor frequency resolution since the frequency 
content will have changed over the duration of the window, blurring the spectrum. 

To illustrate these points, consider the short-time spectrum of a linear chirp, , 
using a gaussian window, e - * 1 / 3tf '\ We can measure the the relative bandwidth of 
the spectrum for different window sizes (crie) in terms of the Standard deviation of 
the spectrum [ss.42 the half-power bandwidth), which is + lj/2o 2 , where 

the units are seconds and radians. Note that when m/0, this grows without bound 
as the window size because* very small or very large. It has a minimum value of 
V*n, which occurs when the standard deviation of the gaus&ian is 1 fufm. 

We see from this that the minimum possible bandwidth of the short-time spectrum 
of a chirp (using a gaussian window] grows with increasing modulation slope. Fig¬ 
ure 2.3 shows the short-time spectra of chirps of various modulation slopes using 
windows that give the minimum bandwidth, For a slope of *0 Ha/msec, the chirp 
peak lias been broadened by several hundred Hz in the spectrum. The point here 
is that. In theory, the usual quasi-stationary spectral analysis methods will give 
poor resolution for sufficiently non-stationary signals. A few examples from natural 
speech will show that such conditions arise in practice. 


f Tliifl is mack pt-etiM; !>>■ Theorem II In Setticjti 2.ti. 
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Figure 2.3. Short-time spectra of linear chirps of severed modWatitm s/opes using 
jffaufflj'an windows that give the minimum bandwidth. At the largest slope, the chirp 
peaA is significantly broadened. 
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Figure 2.4 shoe's cepstralEy smoothed, short-time spectra of various /w/V, uttered 
first slowly and then increasingly rapidly. The spectrogram window used was a 
gauss!an of 4 msec standard deviation, which has an effective duration of about a 
pitch period, the minimum duration that givee a reasonably stable Spectral esti¬ 
mate. The repstral window is also chosen as brief as possible, while still removing 
the harmonic peaks. Notice that the peak in the spectrum at about l&OG T¥ h corre¬ 
sponding to F2, grows In bandwidth with the increasing slope of F2 as seen in the 
corresponding spectrograms in Figure 2.5. Tn case fc) t where the F2 slope is about 
40 Ha/maec, F2 is bo broadened that its peak (l.e., the local maximum} is lost in 
the short-time spectrum. Such an F2 slope is not uncommon For a /w/, In fjf s, F2 
can have large negative slop®, and in /r/ contexts, F3 can have very steep slopes; 
see Figure 2.6. At consonant-vowel translations, where the formant trajectories are 
considered very important for stop consonant identification |Liherman, efc a] 1954], 
the formant motion can also be very rapid; again see Figure 2.6. 

It is worth noting that natural sounds other than the human voice can produce 
non-stationary signals that are “-chirped.” For instance, bird song and bat cries 
contain many rapid FM chirps jGreenewalt 1968; Marler 1979; Neuwcikx 1977]. If 
a sound source is in relative motion to the listener then Doppler effects can Cause 
large frequency shifts in tile received signal across time [e.g., Dudgeon l9dl] r f 
Gliesandi of various musical instuments provide still more examples of signals that 
contain rapidly time-varying spectral content. 

It is also suggestive that neumphyBiologists have found that a large population of 
the auditory cells in the mammalian cochear nucleus do not respond optimally to 

t Some ha-t? (tta so-called OF bats) emit -cvjLtLnucus tunes, evidently depnndiTi p on Doppler shift;: for 
echalpcaiinn. 
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Figure 2.4. Cepsir&lly Smoothed r shor^time ."fpfictra of /w/% rjfttered £ra[ very 
sfonly, then increasingly rapidly. In (c), F2 is so broadened by the analysis that its 
peaA fj'.e., the foe si maximum) disappears, Cf. Figure 2 Jj, 
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Figure 2-5. Wide-hand spectrograms of the /w/'s used in Figure 2.4- Note that 
F2 remains clearly visible with increasing elope in the two-dirnfmsinTi*} display. 


continuous tones, tmt inetead to aweep tonsE, with different populations responding 
to different preferred modulation slopes ranging over ±15 Ha/msec [Me)!Jcr 197&; 
Britt & Starr 1976]. Further, psychophysical adaptation studies have shown similar 
directional selectivity In tiie human auditory system [Kay & Matthews 1972j Regan 
& Tinsley 1979]. 

The above comments are meant to call into question the validity of the quad¬ 
stationary assumption for speech and other auditory signals. We have seen that- 
speech is not always quasi-stationary, even in the sonorant regions. Assuming so, 
means that important features will be missed, having been blurred by the anal- 
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(*) 


(b) 




(c) 


(d) 


Figure 3,0, S^LTlnofframs of rapid formant motion in vinous contexts, (it) /jti/. 
(b) fata/, (c) fbi/ in the context /tubi/. (<t) ,/cfu/ m the context /tidwf. 
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ysis- It is interesting to note that, while the individual short-time spectra, of the 
non-stationary signaiE described above give a poor description of the signals, their 
spectrograms are nevertheless quite legible. This is because when we look at a 
spectrogram, we are no! confined to examining them one-dimensionally along single 
frequency slices, but instead we see a two-dimensional time and frequency surFace. 
In other words, time is not used as a parameter that varies over a family of spectra, 
but as one of the intrinsic dimensions of the representation, 

I believe,. in fact, that thinking of the initial speech processing as consisting of a 
family of independent one-^dimensional apextral analyses parameterized by time is 
inappropiate, The problem should be thought of as a joint time-frequency analysis, 
with the relationships and trade-offs between the two dimensions directly addressed, 
which brings us to the next section, 

2 . 4 , Joint time-frequency representations 

Various ways have bean used to express signal energy as a joint function of time 
and frequency. Certainly the most popular is the spectrogram, 



3 


* 


( 2 . 4 . 1 ) 


which is just the short-time spectra described above displayed two-dimensionally. 
The fact that the simultaneous time and frequency resolution iin the spectrogram is 
bounded by the uncertainty relation has led others to seek representations that do 
not have this limitation. 


This is usually formulated ill terms of the marginal (or projections] of the signal 
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[2A, 3a) 


JTs(w) - j dt. 


(2.4,2*) 


Perfect lime and frequency rennlution in this formulation requires that 


iri(*) = k(f)T and sra(w) = | jf (li/) | a _ 


(2-4,3) 


An example of a joint time-frequency representation that satisfies these require¬ 
ments is the Wigner distribution.,, 



which lb currently quite popular in the signal processing literature [Classeu & Meck- 


lenbrauker I98fJ&,,eL 

The Wigner distribution of an impulse, *£i) = §{t - to) is H^f* (W ) = S[t - to), i.e,, 
the signal energy is taken to lie on the vertical line f = f s In the time-frequency 
plant, Similarly* far a complex exponential, p(e) = <r w , the signal energy lies on 
the horizontal line at w = i^q w) = 2h:^£w — wq))* and for a linear chirp, 

a(t) - ) t the energy lie*? on the slanted line w = mf +• w 0 (W,(t, w) — 

— [a)q — -mi)) (see Figure 2,7a). 

In contrast, the spectrogram of these signals consist of broadened Lines [see Figure 
2,7b). There is, in fact, a sLmpLe relation between the spectrogram and the Wigner 
distribution of a signal 
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Figure 2.T. Winner cTisirjbtri/OJl and specltogr&n j uf some mono-component sj"#- 
Jiila, (aj The Wagner disirifrtit/on resolves tJiMfi perfect iy narrow lines in 

the tj'iue'fre^Lreiicy plant. (b) The spectrogram is a smoothed version of the Wigner 
distribution \(e.g,, if the spectrogram window is a gaussian, then the smoothing ker¬ 
nel is a 2-D gaussian). The Jines are broadened In this representation. 
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where =*=* denotes two-dinnensionitl eonvolution and W t is the Winner distribution of 
the window jCJassen fe Mecklenhraufcer 1080c]. If f (() m a gaussian, 
then its Wig nor distribution is also simple; it U Just a two dimension*] gausaian, 
^f^w) = ^^e _r ; T e -- ' ' , Thus, the Ewodirneiisionai convolution of the Wigner 
distributions in t Lgure 2.7 a by a two-dimensional gausslan wilt give the spectrograms 
in Figure 2.7b, 

If the duration of the gaussian spectrogram window is decreased, then the 2-D 
gaussian that T in cssense, convolves the WigTicr distribution to give the spectrogram 
becomes narrower in time, but wider in frequency, and Vice versa. It should be dear 
from this example that the spectrogram dues not meet the margmal requirement. 

On the other hand, the- Wigner distribution itself has some undesirable proper¬ 
ties. In particular, multi-component signals give rise to cross terms that cannot 
be attributed much physical significance. For example, the Winner distribution of 
i(t) = cos wot is lu) = ?r!£(w — ivq) + i(w 4- wo) ■+■ ^ (tl^ccs 2wftf] [see Figure 
2.8ft), 1 he last term, which lice cm a horizontal line at the frequency origin [varying 
sinusoid][y in amplitude), seems spurious. The spectrogram of cm ojiji., however, Is 
just two broadened lines at w = iwq, which seems better behaved with respect to 
superposition, since rtw w*t ~ + e“ ,w ) (see Figure 2.8b). The cross term is, 

in effect, smoothed out by the convolution that transforms the Wigner distribution 
into the spectrogram. 

These examples illustrate thftt there are various [possibly conflicting) properties 
that we might desire of a time-frequency representation, e.g,, good time and fre¬ 
quency resolution, and superposition for multicomponent signals. We shall, in fact, 
approach the problem of choosing our time-frequency energy representation by first 
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JTigura 2,8, Wigner distribution and spectrogram for cos wot (a) The VV'njner 
distribution of this signai has the ‘spurious* cross term (5(w)2ms 2wot ei! the* origin, 
(b) The spectrogram rfoe? not show tJlj's term; it has been, 317 effect, smoothed out 














specifying a set of desirable properties that the transform should satisfy, and then 
deriving, its form. 

2.5, Design criteria for joint time-frequency representations 

We will restrict the discussion to the quadratic transforms of the signal, which have 
the form 



where hfn , r 2 ; 1 ,w) is an arbitrary function. This condition is imposed because it 
results in a particularly manageable clans, and because the representation of energy 
as a quadratic function of the signal seemE reasonable by analogy to other definitions 
of energy. The class is quite large and Includes many of the joint time-frequency 
representations that have been previously proposed, such as the spectrograms, the 
Wigner distribution, and the Rihaczet distribution [cf. Claasen & Mccklenbrluker 
IGSOcj, 

From this class of representations, we seek ones that satisfy the following criteria: 

(Cl) Shift invariance: A shift in time or frequency of the signal should result in 
a corresponding shift in time or frequency in the transform. Let ^(f) — i(t - rj and 
?(f) = Then we require = F*jf- r t w) and F*(t, u] = F x (t, w -p). 

This property is desirable if we want to interpret the two dimensions of the transform 
as time and frequency. 

Transforms satisfying this condition can be put in the forms 



(2.5.2) 
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and 

F,(t,w) = ?~ l \*{T f is)A x (T,v)l (3,5,3) 

where ** + * denotes two-dimensional convolution, W T is the Winner distribution 

<» 

W x (t t u] = J e~™ r x{t + T/2)x'{t - rf2)dr, (2,5.4) 

^(t,w) IB an arbitrary kernel function, J is the 2-D fourier transform in the form 
era era 

5[^[i + w)| =■ ^ / / t^ -pJt+T ^<?(f,w}dt dw T ^(r p Lf) - /[£(*> w)j t and A x is the 

—TO —TO 

time-freqtiancy autocorrelation function t 

DO 

A.(r t u)=fl!V t [t t »)]= j e -^(i + r/2)*‘{t-r/2)d( 

— PC- 

for £(t) [Cla&sen fe Mecklenbrauker lABCc,, Nats that for a spectrogram, is 

the Wigner distribution of the spectrogram window> by Eq. 2.4.5 and Eq, 2.5.2. 

(C2) Positivity; The signal energy at a given point in time and frequency should 
be real and positive: F 3 (l f w) > 0 for all x, ( t and w, This seems appropriate 
for interpreting the transform as an energy distribution, Some authors have argued 
against the positivity requirement |e,g, Classen & Mecklenbrauker 19S0c\ We shall 
examine the consequences of lifting this condition in the next section. 

(CS) Superposition: This idea is that the time-frequency representation of a 

rmiiti-component signal should he a simple superposition of its components. The 

straight*forward linear formulation of this, i.e., F r+CJJ (i>u7) = F a (i,w) + 

however, is inconsistent with the quadratic nature of the transform;, and the shirt- 

invariance property Cl, This apparent shortcoming is also true, for example, 

f Seme authora call this the Aj»ibj(?ciJCjr function |*.g., OtaMen i Mertknbr suker L9S0a|; others reserve 
this term for 4 jC (r 1 f/)i 1 V*n Tress IS*e8\ 
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nf the spectrogram (Eq. 2.44), Nevertheless, we usually think of the convert 
tionil spectrogram « being well-behaved under superposition. This is because 
JWfl-OWrJappmg component* do superimpose, La., S a _,. f (f jW ) = S^t, w) + J r [| tW ) 
when $n (t, w)Sj,(f 3 w) = o. There are no cross terms in this case. On the other hand* 
the Wigner distribution does not have this property, suffering from cross terms to 
which there cannot be attributed much physical significance, 

We shall require this property for our time-frequency representation, namely 

^*+1 (*, w) = F, (i, w) + F r (f, w) when F r (t, uj F v [t,w) - 0. (2 f 5,6u} 

More generally, we would like .F e+ ^^w] e* F.ftwJ-Myf.w] wka^{f )V )r v (t,u] a 
0™ Stated more precisely, we require for any e > 0, there exists a S > 0 such that 

\F I+y ($ t w) - [F^w) 4- F,(i t w)J| < e when |F*f( 1 w)F ? {^w}| < S T [2,5.66) 

(04) Locality: Signal energy that is localized in time-frequency should remain 
localized in time-frequency in the transform. The advantage of the wigner distri¬ 
bution is that it is perfectly localized according to various criteria, such as preserving 
the marginal distributions [Eq,. 2,4.3) and the finite support properties see Clausen 
h Mecklenbrauker JMCta). f The Wigner distribution, however, does not satisfy 
the positivity (C2) or superposition (Ci) properties, as indicated earlier. In fact, 
positivity (and thus, as wc shall ess, superposition) Is tnccm&LEtent with the time and 
frequency marginal conditions [CEaasen fe Mtcklenbrauker 1900c]. Fortunately, for 
our purposes, we do not require perfect locality, so we can relax the above conditions 
somewhat. 


t The finite ■uppert property itat CB that if a mpoil h.u finite -ixieut in time -or frequency then its 
rcpreiHlLtat-Lon will have the same extniit in the ettmts [ton ding vsnihle. 
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Ftuill Eq. 2.5.2, the transform kernel ^(i.tj) can be vie wed ae the point Kpread 


function on the perfectly localised Wigrecr distribution.. We can therefore measure 
the locality of the transform in time and frequency in terms of the v&rian-cee t 


and 


/ / i^(t, 

[ 2 . 5 . 7 a) 

/ jf w)| s dt(iw 

JTWt^ifitZT' 

( 2 . 5 . 76 ) 


where we aheiiitie that the center of mass of $[t,u)| ! is at the origin, * 


To general, these two measures are not enough; au additional Jocaiily measure is 
important, the covariance 


_ / J tio}^[f > !4?)| 3 ^duJ 
tui j J , w) | 2 dt dui 


(2.5.7c) 


Together, o*, and e^ui determine the covariance matrix and the imociated con¬ 
centration ellipse in the (f, w) plane, 


{* 




( 2 - 5 J) 


When = Ch, the major and minor axes of the concentration ellipse coincide with 
the time and frequency axes [Figure 2,0a), More generally, the concentration ellipse 


f The prenerility of bhin approach (leper. Jr dii the Wig lie j d intributLon t:r.:quaiy nabLifying perl net' 
Ideality. Cohen has j1lov. ii that a quadratic transform Ilia' satisfies the vluft-s:!variance property 
(Cl] wilL meet the time and. frequency marginal conditions (Eg. 2.4 3] if ^(r,0) ■» 1 for all r and 
= 1 for all W. These marginal conditions essentially guarantee that an impulse and a complex 
exponential ate not 'Marred’’ by the tLine-frequency repreeeittaliusii but are nut strong enough ia 
also guarantee that a linear chirp is not ’blurred’ (see Figure 2.7a]. This additional curidiflon is 
met uniquely hy the Wignor dhtrihution■ In other words, we interpret perfect locality to mean that 
the signal transform does not spread the signal energy in an y direct Lon in time-frequency (not just 
the horizontal and vertical directions). We postpone a more thorough discussion of this point until 
Section 2-3, when the necessary mathematical machinery will be introdnoed. 

1 This assumption is not Very restrictive OIL the form of the transform, since we can always r.jiift ;•( i, | 
in time and frequency to satisfy it. This shift, in turn, shifts the tr an sform in time and frequency. 








Figure 2.Q. Canteirtrai&uteffipsm far transform kernels, (a) Non-directional kernel 
fa* - OJ; the coordinate axes can he re-scaled to make the concentration ellipse a 
circle. Thus viewed, the corresponding transform spreads the signal energy equally 
in all time-frequency directions, (b) Directional kernel jt 0 > the coordinate 
axes cannot he rescaled to make the concentration ellipse a circle . The correspond¬ 
ing transform always has better resolution in some tim^frequency directions than 
others. 


may he oriented obliquely fdutiv* to the connate axes [Figure 3.0b). We shall 
call transforms that satisfy the condition = 0 OH their kerne] non-dhectionally 
localized. This name Is appropriate since we can rescale the co-ordinate axes to 
make the concentration ellipse a circle under this condition. Time viewed, the 
transform spreads energy uniformly in all directions in time-frequency. On the other 
hand, if &tu ^ o, then this does not hold, and the transform will he dmwtionaMy 
localized* always having better resolution in some time-frequency directions than 
Others regardless of the scaling of the axes. 

The analysis of the non-diroctional transforms is mote straight-forward. We there¬ 
fore restrict Our attention to this case until Section 2.8, when we shall examine the 
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more general case. We will see there that the principal results are essentially the 
same as non-directiona! case, suitably generalized. The analysis, however, is more 
complex, and La thus heat Left until later. 

To summarise, given a non-directional transform (jt&j = n), cr ± and measure its 
degree of locality in time and frequency. The smaller o; and are, the better the 
time and frequency resolution. 


(C5) Smoothness: Similar to the stationary case, different aspects of the speech 
signal can arise at different scales in time-frequency. For example, Voiced excitation 
can give rise to fine scale structure on the order of the pitch period in the time 
dimension and the fundamental frequency in the frequency dimension. The formant 
structure, on the other hand, arises at a somewhat larger scale. Thus.,, one of 
the design parameters for our transform is the scale in time-frequency we wish to 
examine. Said differently, we want the transform to be smooth in time-frequency 
to a given degree. 


This notion of scale can be be formalized by measuring the distribution of the spatial 
frequencies present in ia(£ 5 iu) s Le., the distribution of energy about the origin oflts 
2-D former transform. Since JjjFi^w)] = $>(f, ^)ri x (r, i/) (Eq. 2.5,3], the relative 
amount of spread is determined by the choice of $(r, j>), which windows the time- 
frequency autocorrelation function. We e&n measure this spread in terms of the 
variances 


y2 = f f diS 

j / |4>(v, i/)| ! dr dis 

2 _// ^ <fr[f,^dr dl, 

* h 


(2,£.9(t) 


( 2 - 5 . 96 ) 


and 
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_ J f TLf|^fr p ^)f a Jf du 

f ** / f'$[ t, 1^)| s dr du 


( 2 , 5 .@c) 


where we assume that the center of mam of La at the origin, f These 


determine the covariance matrix and the associated concentration ellipse in the 


[fi*') p]ane, 


(r 




( 2 , 5 , 10 ) 


When E T v = 0 , we caEI the transform non-cflrecti'pnaJJy smooHij. In this case, it is 
posable to rescale the coordinate axes to make the concentration ellipse a circle, 
a,nd thus viewed the transform smoothes the ejgnaJ in time-frequency uniformly in 
alt direction in time-frequency. On the other hand, if E rv ^ 0 , then this does not 
hold, and the transform will be direct JonaWy smooth, always smoothing more in 
some time-frequency directions than others regardless of the scaling of the axes. 
Just like the locality condition, we will restrict attention now to the non-direction&J 
transforms, Wc consider the more general case in Section 2 .( 1 , 

To summarize, given a non-direetional transform (E ft , =0 ), £ r and measure its 
scale in time and frequency. The smaller L r and are, the larger the selected 
scales. 


Observe at this point the parallels between the stationary and non-stationary anal- 
ysea. Tf we think of the Wjgner distribution as the non-stationary analog to the raw 
power spectrum, then the time-frequency autocorrelation function (the Wigner dis¬ 
tribution s 2 -B fourier transform) is the 2 -D analog to the autocorrelation function 
(thfl power spectrum's fourier transform). Further, windowing the time-frequency 
autocorrelation function smoothes the Wigner distribution, just aa windowing the 


T This assuJiLpiiDii will b* tru* if the IranHfLwm L* rt*L 
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autocorrelation smoothes the raw Rpectrum. In bath cases, the design decisions 
for the resulting transform require selecting a convolution kernel that satisfies both 
locality and smoothness requirmentB. In fael 1 we shall see in the nest chapter that 
the analogy is ev«n closer. 

2.6, Relations among the design criteria 

The various design criteria for our tim&-frequency energy representation arc not 
independent- We shall state the important relationships among them In this section. 
Throughout this section WC assume that the input Bignal 3:(i) is finite energy, [ke. t 
aejM and that /^(fc, w) is a quadratic transform of the signal. This means that 


F^tjW) - *) where foy) - / jf(ct)^(a]da and Tt^ is a (bounded) linear 


operator on £j. 

♦ Shift-Invariance & Positivity; Together these imply that the transform can 
be e:tpre$eed aa a bu per position of spectrograms, f 

Theorem A. Let f^(t T or) hp. posirive and sJjjrt-iuv&rlant. Then it has tJr^? form 




where 5 3 (f, w; g) is the spectrogram having 1 <j ns its window. 

Proof: The positivity of F t (t n w) means that T* h(J is a positive operator and therefore 
has a square root A, i,e. t 


F = = {A*Ax, s} — (j4i n j4a:} = ||j4ar| 2 , 


(AT) 


f Bni!4c|iJH;iie, al [l9Vg| incorrectly alal-e tluL b, positive 4Ild afritt-LD variant quadratic tr*i:slnTm i* 
nccESDjily a fpectrugra.nL. CkastiL k MtcJ.lt Lil-J Ci lijlfil [1$8.4| point out t hiin error. mmtiuniiijr that 
linear -cain.biual.ktiL 2 uf ^).><rv Lcog;ajcie HUlSt be ijitlildsd 
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where ||i(o:]| - = j i(a)| s da. [see Rudin 1973|. Representing the linear operator A 
in terms of its impulse response A^fa)] = / &(r,a;t,wjz(r) dr and substituting 
into Eq. A.I gives 


da, 


m <30 

F x [i t tMi)=. j J A(a,ri4, w)i(r)dr 

-w 

By time and frequency shift-invariance, 

'■■■■' 50 

F x {t + a T w 4- p) = j" j h(a, t\, f, w)£(t + o)e -, ' p ( r ' Ha J dr 
Setting f = u? = 0 gives 


(A.2) 


-00 


c£q. 


<50 00 


^ J j h(ft p r;0, Q)x(t + „),-»(*+■) dr 

-™w i-w 

or, with g a (v) — h (ft, r ; 0,0), 


da, 


— OG |— DO 


ca qq 

= j j aMtir + ^t-W^dr 


da. 


From Eq, 2.4,1, we aee the outer integrand is the Spectrogram S$ (H,w; p„], giving 
Eq. 2,0.1. /// 


• Positivity Sc Superposition; The next theorem allows that positivity implies 
superposition. In fact, it implies a strong form of superposition, as in Eq, 2,5.6b, 

Theorem B, If F E (i,,w) is positive, iften 




( 2 . 6 , 2 ] 














1BfiJaiiong M»njf IMfcfl ...- 

Proof; From the elementary fact about Inner products 
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Up + flli 1 = \\pf + 2 Ks (p, q) + Ml 1 

it follows that 

| ! | p+ ,l|._ [W . +1 |,l|.||*. 4 | Jle<)>l9) P 

<4KPij>r. 

Since {p,q} < ||pi|||?||, 

< 4]|p||*||ff||*. 

Substituting p — Ax and q = Ay above and using Eq„ A,1 gives Eq, 2,5-2, /// 

If the transform is real, the converse of this theorem is also true; i f., superposition 
implies either F f or —Fn is positive. 

Theorem C. Let be real and satisfy auperjMwrrtro n (Eq. 2.$.&a}< Then 

either F a (t J oj) or-Fj^w) is positive, 

Proof: Step L First we show under the hypotheses of the theorem (HT’af;. = 0 ^ 

Tx =0. 

Superposition, says 

{r?,s}(ry T y} = 0 => (T(i + y),i + y) = {r±,s) + {Ty t y). (CM) 
Since the form (Tx, x } is always real, {7V,y} — {Ty,x) l > ho 

{r(s -t- + y) = (F±, £} + 2 Rt{Tx t y} + (Ty T y). 

Thus, from Eq. CM, 

(C, 2) 


{Tx,x){Ty t y} = 0 =>■ Rc{Tx r y) = 0 , 
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Substituting is into Eq, C.2 shows chat Im{Tx >y) ~ 0 also, so that 


4l 


{Tx 1 x){Tjf,t/} - 0 {TiPjif} - 0- (C?.3) 

Suppose Chat {Te, x) - 0. Then by Eq, €.3, (Ts, y) = 0 for all y. If we let y - Tx, 
tllCul (Ti, Ti) =0 and thus Tx o, as desired. 

Slep 2 t We now show that (2"z a i) = 0 => Tz = 0 implies ±T is positive. Suppose 
> 0 and (Ty T y) < 0. Let s = kx + y where k is real. Then 

(Tz t 2) = A a (3V,*) + 2kRt{Tx,y) + {Ty iV ). 

This is a quadratic in k, and sines {Ty, y) < 0 r it has two distinct real zeroes, 

However, since Tar / 0, Tz = kTx + Ty has only one zero In it. Therefore, there 
exists a value of k such that {Tx,2) - 0 but Tz £ O f contradicting the hypothesis, 
and implying ±T is positive, f jf 

This last theorem shows that we can replace the positivity condition (C3) with the 
sole requirement that the transform he always real, and have an equivalent set of 
properties. In other words,, the transform wit] necessarily be positive if superposition 
holds* and if positivity IS abandoned, cross terms will ncccssariEy prove a problem 
for multi-component signals such as speech. 

* Positivity & Locality: The positivity condition places a limit on the Eimo- 
frequency locality of the transform. When the transform is positive, it 39 some¬ 
times convenient to measure locality In terms of the variances of ^(*,,w) instead of 
|r$(i T kp)| , = We define 

o 2 „ //*V(t, Uf) dt dw 

T / / ^(t, w) dtdu 

and 


(2.fi.3p) 
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j / / wVfow) dtdii) 
4,(1 / / £(t, w) tff dw * * 


(2-fi-3h) 


where we assume that the center of mass of ^(f, w) is at the origin, t When t he 
transform is positive, we claim that these variances are non-negative. To show this, 
first suppose the transform id a spectrogram. Then <£[i„ u) is the Wigncr distribution 
of the spectogram window g{t), and using Eq. 2,4.3, it is easy to see that 

Cy- = var |ff(i)| a and ffft = ™r[<»(w)| a T (2.6.4) 

| Si? 


which are dearly non-negative 'cf. DeBruin]. More generally, if the transform is 
positive, it follow a directly from Theorem A that 

cr T = J <ta var | jte(<)p df* and - j c a «jr[ff#(w) | 3 dot (3,6-5) 


where 

^ - [2-^6). 

/ / lifa'MPdtda 11 
— oo 

These are again non-negative quantities. 


Eq. 2.6,5 shows that is the (weighted) average window variance in the represen¬ 
tation of F z (t, w) as a superposition of Spectrograms, Since a spectrogram's values 
at a given Lime depend only on signal values under its window, we see that a positive 
transform at a time f effectively depends only on signal Values within a few aj of £. 


t Thiv asuuakptij&iL ia Dececiiiy for IK? term ^variance 1 ^ ed apply. it t s'lot- s-eceas-ary, howei'er, for the 
Ufaccrtilaty reta-tjcmji jwimEnie<I below to be tru-e |cf, LheBnitn . 

* Thia is istronger nothm of tirnne locality than in the previoua aecttoa. Tlitre, time Locality essentially 
mtuurtd how the tremfovn Hpread «r. impulse. The Wigner distribution is perfectly localised in 
this sense, becauue it represents the energy of *n impulse at time entirely on the vertical line 
t = f 0 in the time-fre^uKUcy plane. This does not mem that Eh* Wignev do-trib union's values at 
iime | n d*yejid only OH the eifiiinl value at d c .. Quite tlie oppsM-ile it true, they depend OH the entire 
eijflLiL. (In fact, tht aipr*] cap be recovered from the WtgUM distribution 5 B values *t any lived time 
tii (up to- * multiplkativn constant] |see CEaaiJeil Ji Muklidlmilhl' ISSHn].) However, when ihu 
transform U positive these two □uliotis of locality coincide, 
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The next theorem states an important uncertainty' relation for positive transforms 
It bounds the simultaneous time and frequency resolution that can be obtained by 
such a transform. 


Theorem D. Let Fj(t,wj be positive and shift-invariant, Then cyan > |. 

Proof: From Eq. 2,6.5, 

ef^n = j Co ff l da j c rt £* da 

where a* — var^^t)]" and X3^ — vjr |G a '(«)|*. By the Schwarz Inequality 

— 

The classical uncertainty relation applied to g&(t) gives cr tt S a > so 

since / c u da - 1 from Eq. 2.6,6. Taking square roots yields the-desired result. /// 

• Locality & Smoothness; Just as in the stationary case, locality and smooth¬ 
ness are con dieting properties. Greater smoothness means poorer locality and vice 
versa, other things being equal. This follows formally from a two-dimensional gen¬ 
eralization of the classical uncertainty relation. 




Theorem E, If ^hi/fc-ln variant, then Oj-Ei, > | and CyEf > with 

equality in both these relations ilF 


(2.6.7) 


*[!,«) « e-Pft'Tc-v'faa 






M 


Proof: First* we show that tffEp > j s Let A(( s t) - ^ / ${t, u?)«“ T dw, Then 
= J[^(i t w)l =» / A[(*r)e T "* e,r dt- Applying the c.Lsu5E?LCJ0Ll uncertainty relation 

to A(t,r) w,r*t* % given 


/ 50 00 \ a f ™ ™ 

“ j J < I j l a |A(( J r)f a ^ J lAmrtf?*, 

V-aa -05 / V» —<» J 


(B.l) 


Integrating £.1 over t and using the SchwarS Inequality 

00/50 « 

\f[f |A£t,T)|*tf / dr 


-00 V-oo 


<» t dd 


-00 


00 


j I j t a |A(t, t)| j dt j w s |*(r,y)[ 3 <fi/ 


dr 


■ 00 V-Od 

Qf?. OC1 


-OS 

50 00 


j t^Ml’dUr j j j . (E.2) 


-00 “50 

By Fuievtl’s thereo-m, 


-CO - 50 


50 


J tA^Ofdr = ~ f ^(t,w)| 2 dW 


[£.3a) 


and 


"-OO 


•50 


I |A(4,r)| ! it= i J I*[r,i/)| 3 *J. 


(E.36) 


-*?p 


-oc 


Substituting Eq, E.3 into Eq, £.2 yields 

DO OQ 

J j f < 


00 ~no 



£ a |^(f,u>)| 3 dtdw J j dp dr J . 

—CO — 00 / 

{Ea) 

Since / / |$(i s Lt')| J dtdw — // drdv y we have £ < By aimii&r rea- 

aoning f ^ < e/yEr- 
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Direct computation of the variances shows that If i^(( p w) ig a2-D gauss] an (Eq. 2.C.7), 
then these inequalities are satisfied With equality, Showing the converse is gome* 
wlia.1 more involved. If these Inequalities are satisfied with equality, then from the 
classical uncertainty relation and the proof above, it follows that ®(t, v) is Gaussian 
in each of its variables. In other words, 

*(r, v) = e -|a^)f a +if vlj 

for all r and v 1 where a > Q and e > 0, Thus t + h(is) - efr)^ + d(r], 

Getting V = 0 and r = 0 shows that A(V) = e(tJ)^ 1 +■ dfO} and d(r) = a(Ojr s + A[0) s 
respectively, so 


£c(^)r s + efo) v 1 + d[0) = e(r) it 1 + o(p)r 2 4-1(0] . (ff.fi) 

Twice differentiating this W,f.t. T and u givee a w (i/) = e^rj for all r and thus 
they are constant, Taylor expanding o{^) and c(t), Substituting into Eq. E.6, and 
equating terms shows that 


^ - |af 0>r* + |a n ‘C'EJ , >v a j^ n -|-tfO]] ^ 


(£\7) 


By the symmetry of the two domains, $(t,w) muat have the same form. Together, 
these imply that 


A{(, r} = e -’[*i* i +A»"+ifi*7»*+fi| 
_ e -jaaf , +A rJ +’ar 1 /i a +*iJ 

V I 


(B.S) 


for all t and t. Taking the logarithm of Eq, B-S, clearing of fractions, and equating 
term? shows that ti = Tfs - Eh Thus, a N (0} = 0 in Eq. E.7, which implies Eq. 2.6.7, 


as desired, 
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2+T, Satisfying the design criteria — the Gaussian transform 


From the last theorem, wc see that a two-dimensional gaussian transform kernel 
jives the best time-frequency locality for a given smoothness. The resulting repre¬ 
sentation will be called the Gaussian transform of the signal, f By specifying r-i 
(= 2<r|) and 2ar£,) for this kernel we are, in effect, selecting a particular time 

and frequency scale for the transform. We may choose any values we wish provided 
5 (positivity), and the resulting transform •will best satisfy all our design 
properties. The result is dearly a general!cation of the solution in the stationary 
ca^O, where a grmaaian convolution kernel of different sizes selected different Spectral 
scales. 

When ffffffl = 1 1 this transform is equivalent to a spectrogram using a gaussian 
window. For larger values of this transform is equivalent to convolving such 

a spectrogram with a 2-B gaussian. 

As a note on its implementation, thia last fact was used to compute the figures 
below. A more direct method would be to compute the Wigner distribution and 
then perform the 2-D convolution specified in Eq, 2.5-2- This 5a not very efficient in a 
digital implementation, however, since the Wigner distribution has to be computed 
at high sampling rates to avoid aliasing. 1 

By performing a convolution on a spectrogram, far fewer time and frequency samples 
need to he computed, since the spectrogram is already a smoothed version of the 

f VV,i hnv-n c.hcwen thin name far obvinus re a, non a. This risks, however, cujifunjuii Willi (lie G :i ir^s 
tVf!W.ratr*Ji tr.irr.iJnr.'nsijnji ivrr. Hille ]Oitl . Id fact, [lie Gaussian triiLifuftil of 1 ]l£ signal f(t) is t]i« 
Swixtimftnninii.s.l Ciusn-'Wcierj.tr.aM trannfcrinatiDJi cf l2i« Wither diutributieh \iV t \') see De 9ruij.il 
19fi?|L 

' In general, the Wigner distribution must be sampled in time *t tariceth* Ny-quiit ra-H of ili-c rigunl 
|Ciaeser 1: Msckfon bran tar 1 OSObl 
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Wigjier distribution. Further! since the gausslan kernel is uneorrelatcd in time 
and frequency, the 2-D convolution 3 a separable, and can be performed as separate 
I-D convaiutiona in the time and frequency directions, resulting in a relatively 
inexpensive computation. 

2 + 8 + Directional time-frequency transforms 

So far, we have assumed that the time-frequency energy representation was non- 
directional in the sense that the covariances and £ t> of the transform kernel 
were both sero. We shall now examine the consequences of Lifting this condition. 
We begin with an example. Consider the two transforms specified by the kernels 

and 

These transforms have identical <7f and a Wi hut differ in the sign of Figure 
2,10 shows their concentration ellipses., and Figure 2.11 gives the transform of the 
chirp for these two cases. Notice that the second transform broadens the chirp 
much more than the first, which should be evident from the concentration ellipses. 
The opposite would be true for the chirp e _i F*. These transforms are directionally 
sensitive, and using ?( and as the sole measures of time-frequency .resolution is 
obviously inadequate in such cases. 

Why consider transforms with such behavior? One answer IS to provide a general 
treatment of time-frequency locality. Another answer is that it is evidently possi¬ 
ble to obtain better time-frequency resolution for aome signals if the transform is 
directionally L tuned" to them than otherwise. Tide would mean that, in general. 
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Figure 2.10r Concentration eMipse.H for transform kernel's with (OJBpJemfrafiry 
orientation selectivity, fsj Concentration eiiipiie for = if f |1_ ^ (it) 

Concentration ellipse for 




Figure 2.11, Directional transform!; for a linear chirp (a) Transform has 

kerneS in Figure 2-Wa- (b) Transform has kerne! in Figure 2.10b. The seoned 

transform .broadens this chirp much more than the first, which should be evident 
from their concentration ellipses. 
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Wt would need a family of transforms each tuned to a preferred time-frequency 
orientation, 


The theory of directional transforms is greatly simplified hy a rotation of co¬ 
ordinates, Let 

*&-»-(-£!« il)C) (-4 

be the operator that rotates a point $ radians in the time-frequency plane. Given a 
time-frequency representation of a signal x(t) „ we can consider the rotated 

representation formed by the composition w). Is this the time-frequency 

representation of an actual signal? The answer is yea; if 

w 

. , 1 H^Eti 4 f , . ■nfiimnff A udi \ 

rj(t) =- c ■/ I X(u?)e^ = [2.B-.2) 

2-jtv cos S J 

—oo 

then W tt = see Van Trees L971J. So if F z has the kernel ^(£,cj) and if G s 

has the kernel then G Xf == FxR#, In other Words, Eq. 2.8,2 rotates the 

signal hy 0 radians in time-frequency, thus the transform with the rotated kernel 
applied to this signal will give the desired effect. 

Relative to these new co-ordinates we can generalize some of the measures of the 
previous sections, For example, consider 

□a 

= j (2.8,3) 

—oa 

This ES the marginal of the rotated transform along W, It follows that the time ami 
frequency marginals (Eq. 2,4.2) of w] isalmFy iri — scs = g and ttj — 2-jt srp =ir ,, s . 

If JTfl(t) = then we will say that the transform preKerves the marginal 

relative to the direction & in time-frequency. Interestingly„ the Wigner distribution 
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uniquely meets this requirement for all $, The proof 13 a simple genera]Lzation of 
Cohen's result. He showed that a shift-invariant quadratic transform perserves- the 
time marginal, Le,, ?ri(t) = 1*[0l a * ^ ^(r,0) = 1 for all r, Using = $/Cs, 

which is easily verified. It follows that Tp(t) — D) = I for all r. This 

implies that ~ 1, which corresponds to the Wigner distribution by Eq. 2.5.3- 

This is the reason for considering the Wlgner distribution 'perfectly localized’ and 
<p(i„w} the 4 point spread Function 1 in time-frequency. 


The amount of spread in time-frequency direction 0 can be measured by the variance 

iX' <» 

1 — (X? —00 

oa (3D ■* 

—00 —oa 

In the notation of the previous sections, eq = 0 ^= 0 , tr w = and 


(2. ft.4) 


A = ( 


cos 


0 

* \ 0 tu> K J \sin 0 J 


(2.8,5) 


Let (?l be the maximum value and a * be the minimum value o|. which corresponds 

to the eigenvalues of the covariance matrix in Eq, 2,8.5, Further, let 0* be the max- 

t crw 0* \ 

. J of the eigenvalue 

In other words, cq and a? are the maximum and mini mum dimensions of the 
concentration ellipse of $((,w) s and $* is angle of the major axis of concentration 
ellipse relative to the time axis. These three quantities conveniently specify the 
time-frequency locality oF the transform, 


In an analogous manner, we can measure the smoothness of the transform in time- 
frequency direction (f by 

GO DO 

v? _ ~ rc - 1 * 1 _ 

“0 DO OO 

— «■ — DO 


( 2 . 8 . 6 ) 
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In the notation of the previous sections, £ T — £s»0 P £*, = Ej . r / 7i and 



, ( E* 

\Eft, ElJynnij* 


( 2 . 8 . 7 ) 



maximum direction. Theu?e three quantities conveniently specify the time-frequency 
fliTlOOthnesS of the transform. 

We are now in a position to generalize Theorem E. 

Theorem F. If F*(t T w) is shift4nvAriAnt, then cn£j > j and tf S Ej > | T w/th 
equality in both these relations iff 




(2.B.8) 


Proof; Applying Theorem E to the transform with kernel we have i < 

^ Similarly), with the kernel j ^ ojEj ^ oiEa, The right hand 

inequalitLea are satisfied with equality iff 0* - 6**. It follows from Theorem E that 


Eq, 2.8.8 ia a necessary and sufficient condition that all these inequalites are satisfied 


with equality, /// 

Generalizing Theorem D requires that we use the directional variance of not 



J / 1 (* T w) <tt dw 



[ 2 . 0 ) 


/ / <pRg 1 (f, u?) dt du 


-oo -ft} 
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Theorem G. Let F x (* T w) be positive and sM/lt-m variant. Then cTjtfjj > i. 
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Proof; Apply Theorem D ta the signal Jt_e+ (<) and the transform with kernel . 

/// 

CurolJary. ir positive And shiTt-jnvarrant, then 

°t a i* ^ 1 

<?tv vi “2 


From Theorem F, we see that a two-dimensional gaussian transform, kerne! gives the 
best time-frequency locality for a given smoothness. In this general case, however, 
the gauss! an kernel may be correlated In time and frequency, t.e. its concentration 
ellipse may he oriented obliquely in the time-frequency plane. By specifying o'j 
(= itjj (= Serf), and 0* for this kernel we are, in effect, selecting a particular 

time-frequency stale for the transform- By Theorem G, we may choose any values 
we wish provided &igu > | 7 and the resulting transform will best satisfy all our 
design properties. 

When ffivu — is this transform is equivalent to a spectrogram with a rotated 
gaiisELan window ^(t) (cf, Riley 19S3, Dungeon i$84]. For larger values of oq-jjjr, 
this transform ls equivalent to convolving such a spectrogram with a 2-D gaussian- 

2.9* A speech example 

In this section we examine a particular utterance, comparing the various signal 
representations discussed above. The utterance is /wioi/ taken from u \Vs owe Eve 
a dollar' 1 1 as produced by an adult male. This utterance has some rapid F2 motion, 
which makee It useful as an example of nun-stationary behavior in speech. 
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(«) 


(b) 



(Cj 

Figure 2*12, Log ^magnitude spectrograms of th& utterance /wfoi/. (a) Wide¬ 
band (gaussian window sfaudard deviation of J msfle^. (b) Harrow band ^standard 
deviation of 15 zilscc). (c) Tn ter mediate band fjiandard deviation of 4 msec). 















































_A speech example ___„_ 

Figure 2.12a ,b show the traditional wideband And narrowband spectrograms for this 
utterance, These are Epceirogranift computed with gaussian windows of standard 
deviation 1 msec and 15 msecs, respectively. The wideband spectrogram, shows 
vertical striatione spaced at the pitch period. The narrowband spectrogram shows 
horizontal stnatLons spaced at the fundamental frequency, They are boLh due to the 
voiced excitation. Figure 2.12c shows a spectrogram whose window duration is 4 
msec, which is intermediate between the previous two. This window size is matched 
to the excitation ill the following sens*. The 2-D gauasian kernel (Eq. 2.6,7] that 
corresponds to this spectrogram has standard deviations of 2 msec by 20 Hz. These 
are In the same ratio as 10 msec and 100 Hz p the pitch period and the fundmental 
frequency, respeclively. This choice gives rise to rows and columns of sharp peaks 
and valleys spaced at the pitch period and the fundamental frequency. We will see 
in the next chapter why the excitation produces this particular structure. 

Figure 2.13 shows the Wigner distribution for this utterance. Compared to Figure 
2,13 it look e almost as if the vertical scale has changed, but it has flot. This repre¬ 
sentation is dominated by cross-teema that give ‘echoes' of the formants in initially 
suprisiilg places. Hut remember that the sum of two comply exponetials at diJfer- 
ent frequencies gave rise to a cross-term half-way between them that had greater 
amplitude than the original terms (Figure 2.5). Evidently, the Wigner distribution 
itself gives a confusing picture of multi-component signals such as Speech, 

Figure 2.1 4 shows the time-frequency autocorrelation function, the 2-D fourier 
transform of the Wigner distribution, for this utterance in the neighborhood of 
the origin. Notice the repeated pattern in rows and columns spaced at the pitch 
period and the fundmental frequency. In Chapter 3 we will see that this pattern 
can be exploited in understanding how to suppress the excitation- 
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Figure 2.1$. Log magnitude of VVigner distrfbuiion.. (Titis is impie/nented as & 
pawd P- 'Vjgner cfislritulion using a gauasiejt Window of standard deviation 40 msec 
i r see CJttasen &: Mec ifJenbrauJter 19S0bj.) 
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Figure 2.I.4* Leg magnitude of time-frequency autocorrelation function in t.h* 
virinitv of the origin. 
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Figure 3-15, f?fi.ufisian £ranjrfor.m with JcerneJ scales chosen to suppress (;j«ritg. 
tibn, <j% = 10 msec and o u , = 100 Hz. fa) 2-D pint, (b) 3-D plat. 



























































§2.3, A spe ech example ___ &7 

Figure 2.15 shows the Gaussian transform of this signal using a kernel of a scale 
chosen to suppress the Excitation, The pitch striatlans are removed, leaving smooth 
time-frequency ridges that correspond to the fo rman t* The ridges are quite sharp, 
although it ie somewhat difficult to appreciate this in the half-toned picture, Figure 
2.15a, The 3-D plot in. Figure 2,15b gives a different perspective on this surface, 
it shows Fl and parts of F2 quite nicely* although most everything above 2 kHz is 
considerably distorted in this presentation. 

Finally, Figure: 2.16 shows directional transforms of this utterance using oriented 
Gauseian. kernels matched to different aspects of the signal. In Figure 2.16a* the 
kernel orientation is matched to the rising F2, In Figure 2.16b, the kernel orientation 
ie matched to the falling F2, These choices bring out the selected formant peak with 
high resolution. 

In this chapter, we have found that a particular time-frequency energy represen¬ 
tation! the Gaussian transform, best satiSea a set of properties deemed desirable. 
There are several free parameters for this representation (o fl tr ut and fl*}, which de¬ 
termine the scale and directional selectivity of the transform. Deciding what scales 
are of interest requires a more specific model of the signal, In the next chapter, we 
adopt such a model, 






Figiirfi 2^16. Dime (.J&Jiaf transforms using urit JiSed GatfSSr&fl kernel matched to 
different els pacts of the signal faj KcinusJ oriejitatjoji matched to rising F2. (b) 
Kernufj orientation matched to Falling F2. 


















Chapter 3. 

Time-frequency filtering 


In this chapter, we continue the discussion of joint timeTrequency energy represen¬ 
tations for speech signals. Here we shah make stronger assumptions about the form 
of the signals, We will introduce a particular model of the time-varying vocal tract* 
5Jid define its'transfer function’, We wilt show that imw-fi’etpiejncy filtering 

can be used to estimate |7/(( p w)| ! , a technique that is essentially a two-dimensional 
generalization of straight-forward. Stationary methods. Further* we will Eee that 
is closely related to the time-frequency representations of the previous 

chapter, 

3.1. The stationary case 

First, let ms re-examine the stationary case. If we adopt a more detailed model 
of the generation of a stationary speech signal, we can say much more about the 
cepstral methods discussed In the previous chapter. The linear model (Fant I960; 
Flanagan 1972] of vowel production begins hy decomposing: the speech signal into 
a vocal source component (e,g, periodic vocal fold vibration) and a vocal tract 
component, which are treated as independent. The vocal tract is modelled a& a 
linear and quasi-titne-invariant fitter with excess pressure and volume Velocity {of 







60 


f)3.J. The H Utianarv we 
assumed one-dimienfiionat wive motion) being analogous to voltage and current in 
circuit theory, The distribution of the poles of the filter's system function constitutes 
the formant description of the vocal tract. 


In other words, ff(*w) t the transfer function or the stationary vocal tract, can he 
approximated by [Flanagan 1972; | 


N 


//(tdr) = [^^T*[tw) + ff*J7- n (fcO] » 

n=l 

where /!„.(j) consists of a simple pole nt 4- lV» T 

= - - , 1 . 

1 ; in) - (cr a + *W„) 

and i> L is the residue at the nth pole. 


(3.1,1) 


(3.1,2) 


= 


Hi 


(3,i,a) 


:(«* - < 1 * + {<*>1 - uj) + 1ilU^(«* -<*»)]' 

We associate a formant with each pale, or more precisely, with each pair of poles, 
since they occur in conjugate pairs, he., = s^ p given the impulse response of 
the vocal tract la real- The impulse response of the stationary vocal tract, in fact. 


is 


where 


AT 


A«) = El-MO+ <*-«), 


n=i 


h m (i) = 


(3,1,4) 


(3-1,5) 


In tills linear time-invariant model, it follows that the spectrum of the cxdtaLioji 
and the vocal tract transfer function combine by multiplication in the power spec¬ 
trum and addition in the log spectrum. ThiE fact leads to a simple procedure for 

t Tliia is the parallel farnTBU-tioav. T3i« H=rial forici illation,, /f(twj — £JT(» -^n-f-Sf — p»(' ILi ' 1 ] “ *1®* 
used. Tfo-c former i* the -.m-Liil fractkiQ expansion of the- LiH-^r. 
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separating the excitation and the vocal tract transfer function in certain (idealized) 
cases. 

Suppose the the excitation is an impulse train, which is a very simple model of 
constant pitch* voiced excitation, In this Case, the spectrum of the excitation is also 
an impulse train* and thus* the speech spectrum U a uniformly sampled version of 
the vocal tract transfer function. If the sampling were unaliascd (i.e, 3 the pitch is 
low enough rotative to the highest transfer function quefrcncies) the original transfer 
function can be exactly recovered by ideal low-pass filtering the spectrum, by the 
sampling theorem [Bracewell 1978;. But thisiaju^t cepstrat smoothing using* in this 
very idealized case* a rectangular cepstral window jOppenheim 1969; Oppenheim 
Si Shafer 1975]. 

Let ns examine this result more closely. The formulation here will be in terms of the 
power spectrum and its transform, the autocorrelation function, instead of the more 
weual log spectrum and its transform, the cepstrum* since the former general Lies 
more easily to the time-varying case, Since the term ‘cepstral fittering* is, strictly 
speaking* reserved for filtering operations on the log magnitude spectrum, we shall 
refer to analogous operations on the power spectrum as autocorrelation filtering. 
The results in the stationary case are similar in either formulation, *f 

If af(1) represents the excitation* Jt(t) the impulse response of the vocal tract, and 
y(t) the output speech signal* then in terms of power spectra and transfer function, 


1 Cepatr*] *nd aBlcrttHTelaiinr filtering can hath he used to separate aigrinf com pone at a that arise 
hi ■difTcrent scales is tlie frequency domain. CspjLnJ filtering is moat appropriate when the signal 
components combine by canvollqtjoii in the time domwn, autocorrelation filtering when they cam bine 
by addition- Both approaches c*u be used for speech, since w<t can use either a serial or par^Llel 
formulatHin of the vocal tract model. 




jj 3- l . rr II 

|T(w)|* = | (t'tj) • ^ jJf(w)j s s or in terms of autocorrelation functions, 

06 

Mr) = I Mt)Mf - t)dL [3.1.6) 

1 —QG 

Let the excitation be an impulse train, f[t;T) — — JtT). Then 

? w 

= ~fr E «(’■-*!'). (3.1.7) 

Jt= “ LT3 

Thus from Eq. 3,1.5, we have 

VO - {a.i.s) 

Provided the duration of Afc(r) is small enough that the terms in Eq. 3.1,8 do 
not overlap, and thus |!T(iw)| 2 can be recovered by windowing A^fr) with a 

rectangular window centered on the origin and of duration T (see Figure 3-l), 


Let us examine the form of A^r), Assume for now that the vocal tract transfer 
function consists of only a single pole, j.e,, itg impulse response has the form of 


Eq. 3.1.5. Then 

M 

-« 

<x> 

j t u " l u{r + t)u[t)dt 

— QO 


QO 





e. 


(3,1.9) 


where = —20;^ is the (half-power) bandwidth of the pole. Thus, provided this 


bandwidth is large enough, the overlap in the terms In Eq. 3.1.8 will be negligible, 
and windowing A y {r) will very nearly recover A*(r) and hence i/[iw)f 2 , f 


{ TEie phase trf LKe tranafer farmijLM! can he found, if df*ir$d, from it? masnUudfl, sui-ce tbia model is 
]tLUkituuUi p]ia,=<; Hr" Op|>e:i hcLm .1: fUna/er 19?SJ. 
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figure 3.1. Recovering the transfer function fry autocorrelation £lltiring, (a) 
Spectrum of the excitation modelled as an impulse trail] (lO msec period). fb) 
Square magnitude of the transfer function, which in this simple eora mple is a sitigJc 
pak of 300 ill bandwidth* (c) Power spectrum, the product of ‘(aj* and ‘(b)*. 
Cepstr&i U/terin^ uses the iog spectrum instead, r/i* approach here generalizes 
more easily to the time-varying case, (continued.,,) 
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Figure 3,1 (continued), iieeovering the transfer function hy autocorrelation 
filtering, (d) Magnitude of the autoeorneJatioJi function, the (hiveme) /buffer iran.*- 
fttrm of '(c)*. Dashed lines show the rectangular window, (e) Fourier transform 
of the miadowed autocorrelation /unction, which very nearly recovers the transfer 
function '( b)' in this idealized case (the effect of the slight overlap of the terms in 
l (d)* ia negligible). 


The analysis of the multiple pole case follows from superposition. Provided tire 
poles are not closely spaced relative to their hand widths, J 


IffMI*fw' [Mf.MI* + 


(3.1.11) 


rt=l 


| The analysis Ln iermn at log apecLru. lr-d cepeira does not require this proviso, since convolutions in 
Ihn t:me Jmiiiit transform (exactly) to sums in the cspstr*J domain- Thu in an adrantm* of the 
■i l j:■ z-L i a.L approach. 



















§3.2, Non-Etationury vocal tract 

from Eq. 3.1.1 and Eq, 3.1,2, hence 




Ah (r) « ^ ^" e “* w “ r > ft: 1 ■ - 1 2) 

from Eq. 3.1.0. Prom this equation and Eq. 3.1,A, we see that windowing the 
autocorrelation function of the output speech signal can still be used to recover the 
transfer function when the bandwidths are large enough that aliasing is negligible. 

A few changes to this model make it more realistic. First, the spectrum of constant 
voiced excitation ia somewhat better modelled as an Impulse train that drops off 
at 12DB per octave [Flanagan 1972 : . This trend can be removed by spectral pre- 
emphasis. 

Second, the sampling m usually significantly aliased, which is a more serious prob¬ 
lem. In this Case, we can recover only a low-pass version of the transfer function. A 
rectangular window is a poor choice in this case, since its transform rings far a con¬ 
siderable duration In the frequency domain. The gausaian ia a good choice, because 
it has minimal bandwidth for a given, window duration, as indicated in the previous 
chapter, (see Figure 3.2). Typically, the standard deviation of the gausslan window 
is selected about equal to the pitch period. 

3,2, Non-static nary vocal tract 

Let us now consider the case where the vocal tract configuration is not necessarily 
static, The goat is to recover the “time-varying transfer function 1 ' of the vocal tract 
from the signal and remove the excitation, as we did in the stationary case. 

Unfortunately, there Is no widely accepted, satisfactory definition of the transfer 
function for a time-varying linear fitter, although there have been many proposals 
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(b) 



Figure 5-2- Estimating ‘aii&SetT transfer /unction, (a) Spectrum of excitation 
modelled a s an impulse train (if? fusee period) , (h) Sqriars magnitude of the transfer 
function, a single pole of 150 Hz bandwidth. This Aas higher ^queFreatieB’ than 
the previous example,- ‘(a)* under samples it in this c* 36 > (c) Power spectrum tAe 
product of ‘(a)* and *(b)\ (continued.-') 
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Figure 3.2 (ctmt.ijiu.cdj. Estimating ‘aliased* transfer function. fdj Magnitude 
of the fiu(cK«nJa£kui function, the (inverse) fourier transform of‘(c)h Dotted line 
show the gausfiian window, (e) Fourier tr&mfon n of the windowed autocorrelation 
function , which recovers a tow-pass version of the transfer function ‘(b) \ 


[«■•■» s«e Lui 1971 j toynes 1968; Page 19S2; Saleh & Subotic 198S; Z&dch 19S0J, 
We shall avoid this diffictilty by constraining the form of the transfer function; we 
shall allow non-Btationarity, but only ] n certain well-behaved ways. 

The vocal tract, of course, is not an arbitrary time-varying filter; it is constrained 
by the physical properties of the articulators. Joehaf 1982 t l 984] has investigated the 
physics of the iron-stationary vocal tract analytically, and found that under certain 
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reasonable physical aeaumptione it Lb possible to generative the notion, of A formant 
to the time-varying caw. Essentially, he replaces the assumption of a static vocal 
tract configuration by the aaeumptiart that the cEeforma Lions ate alow enough to 
satisfy the condition! of adiabatic approximation, which he indicates appear to be 
generally valid from cine X-ray measurements. 

We can thus define the impulse response, h(t, a], for a time-varying "resonance’' of 
the vocal tract to an impulse, ^(i — a), at time a as: 

kit,*) - e />*M»+ii( r Ml J ^( t _ a), (3.2,1) 

where we assume the formant bandwidth ft? is fixed, and the formant center fre¬ 
quency is w; at f = Cl, Note that Eq. 3.2.1 reduces to the usual definition of the 
impulse response of a formant if the time-varying modulation frequency, is 

i.ero. 

In Josha's model, the bandwidth varies somewhat with rate of change of vocal tract 
area, which we shall treat as negligible. Regarding these bandwidth variations, Fant 
[1930] believes they “...are of academic rather practical significance. Of greater 
importance is probably the mere fact that a rapid transit ion of a formant Creates a 
special perceptual J chirp 1 effect.* 1 

Tt will be convenient to examine a more general class of impulse responses than in 
Eq, 3,2.1. Consider the impulse response 

— oJe’Jl (3.2.2) 

where h 0 [t) is the impulse response of a linear time-invariant (LTI) system and 
t[ 0) = 0. Eq. 3,2-1 has this form with *$(#) = We call this a 

frequency-modi] hfed fitter. We shall study this hind of filter in the next several 




£3.2“, JVM-gt&ikjn ary vocal t/act 

sections, since it is possible to generalize the notion oF a transfer function for it 


and it it possible to estimate this transFer function by generalizing the ^cepstral 11 
Methods described above. Of course* an FM filter models only a single pole; we 
Shall take up the multiple pole model of the complete vocal tract transfer function 
in a later section. 

How then can we represent the time^varying transfer function of an FM filter? An 
intuitively appealing candidate is 

FT(t, uj) = Hb[*(w - (3.2,3) 

where iTgftw] Is the transfer function of the corresponding stationary filter with 
impulse response ho(£) (Eq, 3,2:2]. In terms of how we might want to visualize 
the transfer function of an FM filter, this seems attractive; It is just tile stationary 
transfer function shifted at each time by the local modulation, frequency ■y(t). For 
a time-varying formant poEe, 7f(l r w) would have the form of a stationary pole in 
each frequency cross-section with center frequency Wg -j- y(t) and fixed bandwidth 

ft. 

For our purposes, the most important properties that the definition of the time- 
varying transfer function of a formant should satisfy are practical ones — It should 
provide phonetically relevant information about the signal, and it should be com¬ 
putable from the signal. The representation in Eq. 3,2.3 satisfies these properties 
since it is a simple generalisation of the stationary case, which is already understood, 
and it can be estimated from the Signal by methods we will describe shortly. 

The transfer function of an LTI filter, however, also has some nice theoretical prop¬ 
erties that would be desirable when generalized to the time-varying case. In partic¬ 
ular, the transfer function Ha{iw) of an LTI filter, y(x)=n[*(# (1) specifies the 
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eigenvalues for the filter^ eigenfunctiont,, Le u 


T Q [t™] = ffoM^ 


(3.2.4) 


and (2) is the ratio of the spectrum, of the output over the spectum of the input, 
he,, 


HoM = 


Y[») 

XW 


(3,2-5) 


The first property does generaltae to the FM case. Consider the functions 

fi\ i_T[w+'i(f)3dr 
= e J* 1 


(3.2.6) 


These axe the eigenfunctions for an FM Biter I\ with impulse response defined by 
Bq. 3.2,2. This follows from 


T[v*{*)] ~ / A(£da 

-o* 

«i 

act 

— DO 


(3.2.7) 


Further, we see from Eq. 3.2.7 that Jfo(tix) specifies the eigenvalues for the eigen¬ 
functions £> w (f). The value of ffo,(i'w)i however, depends on the choice of the time 
origin. More generally, 

(3.2.9) 


Tl'AuMl = JST(O|u)|pw[0 






SUL3. Time^ frequency filtering 

[a time shift-invariant, where #(t p w) L 9 defined by Eq. 3,2.3, | 
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By comparison, Boine authors have used 



(3.2.0) 


as their definition of the time-varying, transfer function le.g.,. Zadeh 1950], The 
filter's response to a complex exponential e 4(1 i@ H [tpuje 4 **. However, is not, in 
general, an eigenfunction of a time-varying system, consequently H (t^) has limited 

Use. 

Sateh & Subotic [1985] h&ve explored generalising the second property (Eq. 3.2.5) 
to the t) me-varying ease. They suggest using 



(3-2.10} 


as the definition of the time-varying transfer function where i^(f f a?) and mj) are 
joint time-frequency representations of the input and output signals, respectively. 
The difficulty with their approach is that the ratio in Eq. 3.2.10, in general, will 
have different values for different inputs x{t) for a given filter., uni the the LTI case 


l.Fjq, d.2.5). This second property evidently does not generalise well to the time- 


varying caw. 

3.3, Time-frequency filtering 

The remainder of this chapter is used to show that time-frequency filtering can 

be used to estimate the transfer function of FM filters and, more generally, of the 

t h#., imppnne t ■ t — r. Lei and he the lime-varying triiiifer funcibpn and the 

axrrupMtdkit LTI transfer fundiou, respectively, ia the new tbue co-ordinate. Then, = 

+ aoJ ffj(iw) m H a [i[u - t{v)) j = /F(r,w) m ue). 
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time-varying vocal tract. Time-frequency filtering consists of multiplying the time- 


frequency autocorrelation function (Eq. 2.B.5) of the signal i(ij witli a 2-D 

window ®(r, p). The 2-I> inverse Fourier transform of this windowed function, 

7- I |^(r,f)A,(T, l -)|, (3.3.1) 

becomes; the filtered time-frequency representation. The shape of the window, of 
course, determines what energy 13 kept and what is removed in the filtered repre¬ 
sentation [cf. Flandrtn 1984;. 

This technique is in many ways the time-varying E eiieta hza.tioii of the “cepstral 1 * 
methods presented in Section 3,1. The time-frequency autocorrelation takes the 
place oF the autocorrelation function, a 2-D window the place of a 1-D window, and 
a 2-D inverse fourier transform of a l-D Fourier transform in this generalisation. 

The representation in Eq. 3.3.1 also specifies a general member of the quadratic 
transforms presented in the previous chapter, indicating that the two chapters are 
related. In this chapter, our goa] is to show that a member oF this class can give a 
good estimate of the time-varying “transfer function 1 ' of the vocal tract. Happily, It 
turns out that the form of time-frequency window that gives a good estimate 

is a 2-D g&ussian, which is the same as Eq. 2,6,7. In other words, we end up with 
the Fame kind of time-frequency representation as in the previous chapter, which 
was based there on weaker, hut more general goals. 

The resulLs oF this chapter, then, reinforce and reinterpret those oF the previous- 
chapter. Further, the analysis here suggests which scales to choose, decisions that 
were free parameters of Chapter 2, In particular, for voiced speech, cfj is matched 
to the pitch period, and c^ is matched to the Fundamental frequency. 
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We have just given the basic result of this chapter, It remains to demonstrate its 
validity, i.e., that this kind of filtering will give a good estimate of the tlmt^varylnq 
vocal tract “transfer function*. This requires several steps in which we gradually 
generalize the form of the filter that models the vocal tract. In Section fi.4, we 
re-examine the stationary cage, this time in terms of the time-frequency autocorre¬ 
lation function.. In Section 3k5, we consider PM filters that have a 31 nearly varying 
modulation frequency. In Section 3.0, we use a locality argument to generalize these 
results for quaai-atationary filters and for FM filters that have a smoothly varying 
modulation frequency, respectively, In Section 3,7 t we use a superposition argument 
tp treat the multiple pole case, 

3,4. The stationary case — re-examined 

So let us assume for now we want to estimate the transfer function of a filter that is 
time-invariant. We will show how the time-Frequency autocorrelation function can 
he used to produce this estimate. 

This will really just be recapitulation of the stationary argument presented in Sec¬ 
tion 3.1. In fact, j 4_»,(r,— j4a(r) n so we see the correspondence is vary cloee. 
But with the time-frequency autocorrelation function we will be in a posiLion to 
generalise these results to the time-varying case, so it is worth the effort. 

Letting a(i) represent the filter input, ^l(t) the filters impulse response, and y[t) 
the output, we have 

CO 

A r (r,v) = j - Uu) di. ( 3 . 4 , 1 ) 

In other words, the time-frequency autocorrelation function i^) consists of the 

convolution of and falong the t dimension. This is analogous to 



§3.4. The sta,t jojw y cage — re-e x ara r ned 
Eq, 3.1,6. 


li 


Let the filter input be an impulse train I[t\T) ^ £(t - nT). Then 


Ai(t, u) 


w 

f «"■**£*(*-"T + t/2) 6 [t - mT - r/2) dt . 

X Iftft. 


Substituting f — t - -|(m + and r f = t + (m - 

+ r'/ZW - r r /2) dt* | e -4C«-H *\T» 

The quantity in braces ie the time-frequency autocorrelation function of an impulsE 
which b [see Classen & Meckl enbrattker l&&0ah Thus, 



A,h.v) = ^^<(r + (ni-r.)T)«- i i(" + »l lv . 

ii m- 

Letting k-n-ffi, 

ft 1 

The quantity in braces is the fuurier transform of an impulse train which is- 

itself an impulse train y 1 ) [see Rraceweil 10T&]. Therefore* 


a,m - ^y. - ‘0*0- ~ t-) 

i n 

Jr ft 

Eq. ‘i.4.,2 shows that the time-frequency autocorrelation function of an impulse 
train is a rectangular grid of impulses spaced T apart along t and 2-jr/T apart aEong 
is (see Figure 3.3). f F]q, ZA.% is the two-dimensional analog of Eq. 3.1.7. 


f Sub-art |11?5^| ba-8 d nrivr-rl the time- fruquE ncy auiucurrEliitiOti fujujliuji fat- a traUL of pulses of 
arbitrary ?h np-e, * r-nmjlt t.!h n.L is Lmporba.tii in (Ile IJieotv uf rid-sr, The above- result follows Formally 
from tlii* if ilic pulssH are given llhL l -jjeh ojli.I approach ier-o w idr-h In the limit- 
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Figure 3.3, M&gnt ttf de of the time-frequency autocorrelation function, of an im¬ 
pulse train (10 msec period), 
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Figure 3.4. Magnitude of the time-frequency autocorrelation function of the out¬ 
put of an LTI filter excited by an impulse train. Jb Ibis simple example the filter 
consists of a single po/e of 300 hz bandwidth. 
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From Eq. 3.4.1* we have 
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MrM ~ - kT ’ - ~J?)> <*■«) 

k n 

the two-dimensional analog of Eq. 3,1.3, A t (r, jy) consista of a rectangular grid oF 
shifted r slices of A*[r, u) (see Figure 3.4). 

Provided the terms in Eq. 3.4.3 do not overlap, j4*,[r*0) can be recovered from 
by windowing it with a rectangular window that is centered on the 
origin and that has length !T S width 2w/T, and height Tf2 w [flee Figure 3.fi). From 
we can, in turn, recover wnce 

M> » 

~SO —«5 
K» 

= / W'jl[*,cj'] dt 

"W 

= I»(M1 S - ta-4.4j 

On the other hand* if the terms in Eq. 3,4,3 do overlap somewhat, then a low-pass 
version of |ff(iw)p can still be recoverftd, since 

7~* J" 1 

where is the tiinc-fnequcncy window, and Is its two-dimensional in¬ 

verse Fourier transform. In this case, using a rectangular window on the time- 
frequency autocorrelation function is a poor choice since its transform rings for a 
considerable duration away from the origin. A gaussian window minimises this 
problem. 
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Figure Roc tangu Jar window (very nearly) recovers i un*H* 9 od > transfer func¬ 

tion. (a) Windowed UiUc-fre-guency Rutworrelation function in Figure 3.4 „ (h) 
Square magnitude of transfer function, the 2-D inverse fourier transform of *(&)’. 
in the WaasfNJ' case, i.e., if tjfoe terms in Figure 3.4 were to overlap significantly, a 
gaufifiian window woufd he more Appropriate. 















^3.5. Linear ly varying modulation frequency ___ 13 . 

Let ub examine the form of j4fc[f h t/) assuming for now that the filter consists of only 
a single pole, i.e„ its impulse response has the form of Eq, 3.3L5, Then 


□G 


r 4 ’ 1 i it 


i» 

= c hj„t j 


-« 


A, + iv 


(3,4,6] 


This last equation is the two dimensional analog of Eq. S-l-ih 


Thus* provided the pole bandwidth Li large enough, windowing is) can recover 

most of A A (r,x>) N and, hence, a low-pass. version of [fl r (iu.')| 5 , 

3.5* Linearly varying modulation frequency 


We now consider the case where we want to estimate the transfer function of an I’M 
filter that has a linearly varying modulation frequency, Le., nf(0 — in Eq, 3.2.2. 
This means 

M*, a) = h^t - (3.5.1) 

The previous section was the special case m = 0, 

Let US find how passing a signal through such a filter modifies its time-frequency 
autocorrelation function. As usual, we let x(t) represent the input Lo the filter and 
y(i) the output. Thus, 

« 

p(t] = j c(ft)h(t,a] da 

—OD 

m 

= S mt3 J x[a)e^ ma> h 0 (t - a) da, 

-<x> 


(3.5.2) 
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Lei Ling £(*) = and jj(l) = we have from Eq, 3,5,2 and 

Eq. 3.4,1, 

□d 

~ f "'M*, C r - <» <f*. (3.5.3) 

—era 

In. other words, the time-frequency autocorrelation function af ;y(f) consists of the 
convolution of the time-frequency autocorrelation of z(tj and hofg) along the r 
dimension. 

We arc more directly interested in ri 3 and A r , than A± and A f , But thig last 
transformation in simple, since the time-frequency autocorrelation function has the 
following nice property; if ®(#) — x(l)e" i l m(3 J then [Van Trees 1971] 

A i [r, y) = At (r, u + mr), (3 r $ . 4) 

In other words,, multiplying a signal by a, linear chirp shears It* time-frequency 
autocorrelation function along the v dimension (see Figure 3,6). 

Combining Eq, 3,5.3 and Eq. 3.5.4, we see that 

DC 

A *l T * t ')= f + (3.5.5) 

-« 

In words, the time-frequency autocorrelation function of a signal passed through the 
fiJter in Eq. 3.5.1 can be found by first shearing Its input time-frequency autocor¬ 
relation. function, convolving that with the time-frequency autocorrelation function 
of ho(t), and then shearing the output time-frequency autocorrelation function in 
the opposite direction all with respect to the y dimension (see Figure 3.7), 

When the filter input is an impulse train the filter output is 

'-l - pK D- ^-^( 1 ,-™^u)- ^). [ 3 . 5 . 6 ) 

Jfc » 
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Figure U,©* MuJfjpJyin# a sfpiif bye i1Hl shears Its time- frequency autocor- 
re/ailon faiiGthfo t/ + crtf). 


In other words, consists. of a rectangular grid of shifted r aljc.es of */) 

that havo been sheared in the IS direction by slope m (see f igure 3.8). 


If these terms do not overlap, then we- can window about the origin and 

recover the single term We can then take its Inverse 2-D Courier 

transform la obtain ’H(t, ^)| 3 - 

00 :» 

J- 1 [Ap 1 ,(r,0)(((--TO-r)l= f f A^r.OMv-tm^'-’-UTdv 


and from Eq. 3,3.3, 


(see Figure 3.9). 


= WMf 


(3.5.7) 


On the other hand, if the terms in E-q. 3.5.7 do overlap somewhat, then a low-pasu 
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fi'J 


*(*) 



dt 


^r{ r i ]/ ) " A-i{T,v - mf) 


Figure 3,7. Obiafning tiie time-dr &rj u en cy olt to c orrtU £ i on func tfon, j4j,(t,ij) s of a 
si'ffna/ passed tbraujfb the Slier in Eq. 3.5 r l. 
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Figure 3r8 r Mignitudfi of ip me-frequency autocorrelation function of the output 
of an FM filter wifh linearly varying modulation slope (10 Hs/nastc) excited by 
an impulse train (w msec period). In this cA-imple, tJif cor capondi ng LTl After 
consists of a single pole of 300 hi bandwidth. 


version of |ff[t, wj| ? can still be recovered , since 


5” 1 fft (f, »)A t (t s v J | ps f L ft [r, v) (r* 0 ) 6 {v - mr) 

= —$[**&) ** - *»l))| a 

= 3r4{t»u }*■■* |iF{f,w)| a , 


(3.5,8) 


where as the tiniE-froquency window, and ia it? inverse Fourier trans¬ 

form. A 2'D gaussian window is used, and its dimensions are matched to the period 
r and the fundamental frequency 2sr/T, respectively (see Figure 3-. 10). 
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Figure 3,0. Jiectangular window (very nearly) recovers 'uuajjased" transfer Func¬ 
tion, (&) fflmdowed time-frequency autocorrelation function Jjl Figure 3,8. (h) 

Square mag/tit u de of transfer function t ihe2-D inverse Signer fransfomi of *(*)’. In 
the 'aliased' case, Ie>, if the terms in Figure 3.8 were to overlap, a gausslan window 
would be more appropriate. 
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So far, we have shown that the time-frequency filtering can be used to estimate the 
transfer function of two kinds of linear filters — time-invariant and FM filters with 
linearly varying modulation frequency. We now show that more general eases wit] 
follow from the time locality of this operation, 

3 + 6 + The quasi-atationary case 

We next consider the quasl-stationary case in which the vocaj tract changes slowly 
over time. The traditional way to deal with this situation is to extend the stationary 
arguments (Section 3.1] by substituting the short-time spectrum for the spectrum 
of the entire signal* There are thus two windows involved in this analysis the 
spectrogram window, , and the autocorrelation function window, 

The "two-dimensional 1 approach that we have outlined above extends directly with¬ 
out the need of an additional window. In fact, the estimate of | H | 3 is a positive 
representation of the signal energy 

so from Eq. 2,6..o we know that effectively depends only on signal values 

within a few a of fp, f Provided the quasi-Btatiomary signal docs not change much 
over this interval the stationary results of Section 3.4 generalize immediately. 

These two approaches fur quasi-statio-nary signals, the former using a 1-D window, 
on the Eignal and a i-D window, on the autocorrelation function, and 

the latter using a single 2-D window, nn the time-frequency autocorrela¬ 
tion. function, are related. In fact, ^(r T w) = The latter approach 

specifies the time and frequency scale of interest independently with each of the 
dimensions of the window 4>(r, j/). This is somewhat cleaner than the former, which 


f PruviJed cr-^aif > 
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sekcta the time and frequency scales with its two windows, ty s (f) and but 

not independently. 

3.7, Smoothly varying modulation frequency 

Suppose the modulation frequency i(t) in £q, 3.2,2 varies smoothly as a function 
of time. In other words, it is approximately linear locally, with V( t) am]], For 
example, a formant with a trajectory that does not have sharp bends in it can be 
modelled this way. By comparison, quasi-etationarity requires the trajectory have 
shallow elope, Le,, is smalt. 

The locality argument used in the preceding section to show that the estimate of 
extends to the quasl-slationary case applies equally to the case here. If 
the modulation slope, y(r), does not change much over an interval of a few cq, then 
the results of Section 3.5 on filters with a linearly varying modulation frequency 
generalise immediately to the smoothly varying case. This is because |/£■(*,«) E 
depends only locally on the signal, 

3*8+ The vocal tract transfer function 

Thus far, we have defined the notion of a frequency modulated filter and its tiine- 
varying transfer function, and we have shown how to estimate this transfer function 
from the output signal, provided the modulation slope varies sufficiently slowly. We 
did this because we modelled each formant pole as an FM filter. The vocal tract Is 
modelled as a weighted a urn of formant poles, i.e., its impulse response is 

N 

h(f n fl) = <0 + <(«)*-»(*, a)|, (3,6.1) 

where is the impulse response of each pole, Eq. 3.2,1 (cf, £q, 3,1,4), 
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How can WE define the transfer function of such a filter? Extending the stationary 

case (Eq. 3.1.1) would suggest 

JV 

E(t jV ) = 53 + <C*)JT-^,w)|* C3.S.2) 

There are two advantages of this definition. First, it is a simple general izaton of the 
stationary ease; it allows us to think of transfer function of the time-varying voeal 
tract at a given time t as equivalent to the transfer function of a stationary vocal 
tract Cor the current articulatory configuration, Second, we shall show that it can 
be estimated from the speech signal, by the methods we have already presented, 
in fact, These two conditions, which we can call abstractly phonetic relevance and 
compatibility, are probably the most Important for any representation, to satisfy 
in the analysis of speech. L'nfortunateiy, there is no simple relation between the 
system's eigenvalues or the time-frequency representations of the input and output 
signals and this definition of time-varying ‘transfer function 1 . These latter notions 
just do not generalise well to this time-varying case. 

Two facts show that the transfer function in Eq, 3.8,2 can be estimated by the time- 
frequency filtering technique we have described above. The first specifies the effect of 
variable gain at the filter output on the transfer function estimate, which is given by 
Eq. 3,9-4 in the neat section. The second specifies the effect of adding the output 
of two filters together Oil the transfer function estimate. Suppose that h[t, r) — 
+ h 1 {l 1 r] and that. |ff](i f w)||J9a(*, w)| = 0. Then |-H[i,w}| s = v)\* + 

w)| 2 , In, other worde, auperposltLon holds provided the transfer functions do 
not overlap. This last condition means that we must consider only regions where 
the formants are not too close to each other, as wt did in the stationary argument 
in Section 3,1. fcf. Eq. 3,l.llj, f This relation holds not only for the transfer 


1 Of e :>u;a«, (uuts often com*; tpjjEther, but wt ■$ noie aucli tiratt-fraqueii'Cy reg'kx’.ii for clm- 

plklty in tliif arpuaioat, A mor? bhnrnu^h titilouit wouEd try to deal with llieve rcgiortii also. 




S3.0. _ Th e fraflSiz iJgs jqfi diannel_ _ 

functions involved, but also fo t the estimates of the transfer functions given by the 
time-frequency filtering, since they are positive representations of the signal. 

Using these two facts, we have 

^{-*,.,..-1,1, 

= ^(ti«}«|JT(i,w)! a (3.6.4J 

for the filter in Eq. 3.8,1, as detdred. 

3.9, The transmission channel 

It is convenient at this point to consider the effect of the transmission channel 
characteristics oh the estimate of the transfer function Jjyfi, wj| s . The results will 
prove useful in the ne*t section. We examine two cases — the transmission channel 
as an LTI system with impulse response r(t) f and the transmission channel having 
variable gain s(t). 

There are two facta about the Wigner distribution that we need |Claasen k Meek- 
lenbrauker lQSQaJ. if p (f) - r(f) + ^(i), then 

□0 

l^p(i a w)= J - t,w) dr, ( 3 r 9 .i) 

— DO 

and if ?{£) = a(t)y(t), then 

DO 

W q (t t w) - ~ j tY x {t,a]W v (i r w ^ <i) da, (3.9,2) 

-c* 

lo other words, in the first case the Wigner distributions are convolved in time, and 
in the second case they are convolved in frequency. 
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If the spectral shaping of the first transmission cham'sel is gradual, i,e n r(t) is of 
short duration, then from Eq. 3,9.1, W^i, w) ■& || 3 W 1 ^(£ s w). if the gain varia¬ 
tions of the second transmission channel are slow, then from Eq. 3.9-2, ss 

z(£] w). It follows from these equations and Eq, 3,5-8 that 

T-' |»(r,^yl t (r.«,)| « i*{l,w)|J?(iu)Hff(t,w)| S , (3-9.3) 

and 

f~ l k ^(i,w) ** |?(t)| ! |Jf(i p w]| s , {Z.Q.A) 

Thus, these simple kinds of transmission channels have simple effects of the transfer 
function estimate. The broadband LT1 channel essentially slopes the estimates 
frequency slices and the slowly varying gain channel shapes its time slices. 

3.10. The excitation 

Up to now, we have assumed the fitter excitation ha -1 been an impulse train. We 
consider more general (and realistic) forms of excitation in this section. 

Wo can create a general periodic excitation from an impulse train by passing it 
through a LTI fitter whoee impute response r(t) has the excitation^ poise shape. 
The output can then he passed through the time-varying filter hit,- a}. Provided 
the spectral shaping by eft) is gradual, Le., r[t) is of short, duration, then these two 
filtering operations will commute. The assumption ie that the time-varying filter 
can be considered quasi-statlonary over the duration of r(t}. This is a reasonable 
assumption for the gradual spectral rolloffs produced in speech excitation, Sines 
these two operations com mute under these circumstances, the effect of the filter rfi) 
on the transfer function estimate is given by Eq. 3.9.3, 


Similarly, slowly varying changes in the amplitude s(i) of the excitation will resulL 
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in corresponding changEE in the amplitude of the filter output, with the effect on 
the transfer function estimate given by Eq, 3,9.4, The pitch period need not be 
constant, either. Using the locality arguments again, we only require that the pitch 
period changes sdowEy. 

Finally, consider the ease where the filter is noise-excited. Martin & Flandrin 
(l&e&| discuss using time-frequency filtering as a general approach for analyzing 
non-startiOtfiLry random signals, Our model here involves not only notL-Btationari ty, 
but also noise that is not additive, and a careful theoretical analysis of this case has 
not been attempted yet. We must be content, for now, with the following comment. 
We have seen in the previous chapter that these methods can be need to select time 
and frequency scales that remove the fine structure introduced by the excitation. 
This, of course, remains true for this case. 




Chapter 4. 

The Schematic Spectrogram 


4*1- Rationale 


In the previous chapters we have seen how to obtain a well-behaved representation of 
the the speech energy, with a choice of the time and frequency scales of interest, For 
the next step we are faced with a methodological decision. If we are willing to make 
strong assumptions about the signal early on, then we Can use those constraints 
in some detection scheme. For example, one can assume the speech spectrum is 
composed of a number of poles, and use. anaiyais-by-synthesis or linear predictive 
coding methods to St these poles to LHe spectrum in a formant analysis. 


Tn this approach, a synthetic multiple pole s 


pectrum is fit to each short-time spec¬ 


trum, Typically, the pole frequencies can be varied, but for trac lability the num¬ 
ber of poles and their band widths are held fixed- Stevens t House [1955] and 
Olive ' 1071), for example, computed mean-square difference between lug-magnitude 
short-time speech spectra and a function of the form: 


IV 


“III 


Li - .3**) 


I n=l 


+*, 


S n — » + *Wn, 


( 4 , 1 . 1 } 


The poles of the synthetic spectrum that is found to have the teast RMS error 
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are taken to be the formants. The permissible range for each of the poles ]g often 
restricted to the typical ranges for the corresponding formants in this method, 
different versions of this method are identified by the search strategy used to find 
the best match, Some have used exhaustive search [Stevens Jr House 1955; Kelt, 
et a. 1061; Matthews, et al Infill, so-called analysis-by-synthesis. OlivellOTl] used 
hill-climbing techniques. Linear-predictive coding can be viewed as fitting a fined 
number of poles to short-time spectra, using a slightly different spectral distance 
measure than RMS distance [Atal 1971; Market As Gray 1976]. The great advantage 
of LPC is that it provides a simple closed-form solution to the search for an optimum 
fit. 

One problem with tills approach, as stated, is that it depends on the quasi-stationary 
assumption. The short-time spectral contribution of a formant in rapid motion is 
poorly modelled as a pole with a bandwidth appropriate for a stationary' formant. 
Even when the bandwidths are variable, as in the LPC technique, the diffuse spec¬ 
tral contribution of the moving formant can cause incorrect Formant matches. In 
principle, these methods can he generalised to the time-varying cane. Liporace 
.1975], in fact, has done so for the LPC technique, 

this, approach, however, Euffere from a more general problem. The model used to 
generate the eynthetic spectra has little notion of the source or transmission channel 
charactercEtic, or of nasalization. These effects can contribute significantly Id the 
speech spectrum, “competing™ for poles that were meant to he fit to the formants, 
and thus often resulting in pole distri but lone that have poor correspondence to the 
formant distribution. The degree of the fit to a particular point in the spectrum 
depends on the entire pole distribution; i.e. t on the number of poles used and where 
each pole is positioned in the spectrum. Thus, errors in one part of the Spectrum 
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are propagated to Other parts in the very first Stage in the analyse 

For example, Figure 4,1 shows pole locations found by LFC analysis using the 
autocorrelation method. The order of the analysis was chosen, as is customary, 
to aJEow for two complex poles per 1DDQ Hz plus 4 poles for matching the overall] 
Spectra! balance (e.g., 12 pole analysis for 4KHz filtered speech), A hamming 
window was used of 25 mEec duration, also a typical choice. In f igure 4.1a, we see 
that this analysis can perform poorly in regions of rapid formant motion. In Figure 
4.1b,c,. it appears that the addition of a nasal resonance in the neighborhood of 
Fl resulted in spurious t unstable behavior in the neighborhood of F3, Decreasing 
the duration of the window sometimes gives better performance in non-stationary 
situations, but increases the overall instability of the solution. 

The problem, in general., with making such strong assumptions early on in the 
analysis is that they are seldom universally true. The excitation, the nasal tract, 
and the transmission channel («g- room, acoustics and noise) all conspire to make 
formant analysis more difficult than just fitting poles to a spectrum. 

The approach we lake here is more conser&tive, influenced by a similar methodology 
applied to vision by Mair[l&SU], He suggested (l) the principle offcast commifiitejifo 
make no decisions that may have to be taken back, later in the analysis, and (2) 
the principle of explicit naming: produce as rich and Useful a symbolic description 
of the Input signal as possible, but without any early commitment to its physical 
origin. This description can be then further organized and analyzed with the goal 
of finding its physical correlates. 

Applying these guidelines to speech suggests taking the energy representations as 
in Figure 2,15, and producing rich, symbolic descriptions of the significant features 
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Figure 4.1. Examples of problems with *poic-liith'ig' itpp roarh, (a) Poles locations 
for ul iterance /wioif of Section 2.9. Note the poor performance in the regions of 
rapM F2 motion- (b) Spectrogram of /c/ in the context /en/. (e) Foies locations 
for tiiis nasaJi^ecf vowel. Note the apurioUB behavior in the neighborhood of F3. 
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i 4.2. Spectral Peaks 

there- There are aeveral features (at various scales) that suggest themselves: time 
discerniinuites (tip and down edges) useful for finding onsets, offsets and bunts; 
Sjjne-freqLrency ridges easily seen in Figure 2,15, useful for finding the formants 
and perhaps channel resonances:; and some form of gross spectra/ haJance mea¬ 
sure, also useful for formant and channel analysis. We call this composite symbolic 
representation the schematic spectrogram. 

4.2. Spectral Peaks 

To create this representation, we must come up with computations that, identify 
these features. This is not as easy as it may seem, since the features clearly visible 
in Figure 2,15 may nevertheless require some non-trivial computations to detect 
reliably. We focus on how to find the time-frequency ridges, due primarily to the 
formants, in the next sections. 

An obvious Way to try to find these ridges is to identify peaks in vertical slices of the 
time-frequency energy surfaces. This approach lias, been tried by several authors, 
with the main difference between the various instances being how the smoothing 
was accomplished- Flanagan [l95E?j used a SI ter bant whose output was low-pass 
filtered, SchaferAsRabiuer used repstnl smoothing j|Gppenheim 19€% Oppenheim 
&: Shafer 19T5 ] t white McCandiess (197^1 used LFG-hascd smoothing [Atal 1971; 
Martel & Gray 

To exa-mme this technique, we will use the smoothed time-frequency surfaces of 
Chapters 2 and 3, Since these surfaces are smooth, the spectral peaks can be 
found by Looking for maxima, Le., (negative) zero-crossings in Figure 

4-2 show these points for the time-frequency energy surface m Figure 2.15. While 
the horizontal ridge due to Fl is well captured, the steeply rising F2 is very poorly 
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Figure 4*2* Peats in spectral crons-sectiom of the time-frequency energy surface 
113 Figure 2.1&, The energy ridge due to F2 is poorly captured by this peak compu¬ 
tation. 


captured. This may aeem fluprising at first, but the reason is simple, 

3-.0.S models the situation with F2. The formant pole jPjui — fnt] with time- 
frequency slope m is smoothed by the 3-D gausaian r£(t, w) to give This 

will produce a time-frequency ridge in F(t f w) that has, a roughly constant width, 
independent of slope m, when measured perpendicular to the formant trajectory in 
the time-frequency pEane. Howe%'ti t 1 the width of the ridge in a vertical 1 slice increases 
with increasing slope; evidently in Figure 2,15, F2 was sufficiently broadened that its 
spectral peak was completely lost to other effects in the signal, Le r , other formants, 
noise, the source and transmission channel characteristic (cf. Figure 3,4). 
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This effect is not an idioayncraay of our particular choke of time-frequency energy 
representation. Is is true, for example, of any representation computed with signal 
windows [c.g., any positive representation, by Thru, A), since if the formant moves 
enough in frequency over the duration of the window, ita spectra] representation 
wilE be significantly broadened. 

One could rethink the design choices for the time-frequency energy representation, 
trying for better spectral resolution at the expense of our chosen criteria. How¬ 
ever, the problem is not there, as a re-examination of Figure 2.15 will show,'. The 
F2 ridge is clearly visible ill this representation, it looks, no more broadened than 
the stationary FL This is because we see both dimensions oF lime and frequency 
simultaneously, and as the formant ridge broadens in frequency with increasing 
slope it narrows in time. Its prominence depends on its width perpendicular to its 
trajectory, which does not change much with slope-. 

Why then did wc confine our peak detection methods to vertical slices? It Was the 
usual quaei-st&tfapaxy prejudice of thinking of speech analysis in terms of a family 
of one-dimensional spectral analyRes parameterized by time. Just like the energy 
representation problem, this problem is inherently two-dimensional and should be 
treated as such, 

4,3, Ti TUG-frequency ridges - nutt^directiona) kernel 

The approach we will use for detecting time-frequency ridges will depend on whether 
wc use an directional or a nOU-directionai kernel for the underlying energy repre¬ 
sentation- If wo use a non-directional kernel, the problem is simpler, so we shall 
address this first, In thin case, we begin with a single time-frequency representation 
at A given time and Frequency scale, as In Figure 2.15, and the problem reduces to 
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finding She ridges in this smooth, tWO-dimcnional surface. 

How can we find ridges in a smooth, two-dimensional surface? This becomes a 
problem in differential geometry. As such, let us look at the gradient and curvature 
vectors of the surface in the neighborhood of a ridge. Figure 4,3 shows them For the 
time-frequency surface in Figure 2.15 in the neighborhood of the initial steep F2. 
In particular, the solid vectors are used to depict the direction of the gradient, VF, 
i.e, s the local direction of steepest ascent. The dotted vectors depict the direction 
of greatest downward curvature,, gdc F, i.e. t the local direction in which the surface 
Curves the most downward from the tangent plane. 

A precise definition of gdc F is in order. We will nee the second derivative as the 
measure of curvature — this is sometimes called nnnormftJizflrf curvature. This is 
used instead of normalized curvature (which has the form + {’f) S ] in one 

dimension) for two reasons. First, it is simpler. Second, unnormalized curvature 
scales linearly with a change in the amplitude scaling, normalized curvature does 
not. If we use the former, our ridge computation proves invariant under changes in 
the amplitude scaling. 


Given this, we define gdcf as the direction vector of the minimum second direc* 
tional derivative at a given point. More formally, let 





(4.3.1) 


denote the Hessian matrix for F(f,/), Let i denote the eigenvector of H corre¬ 
sponding to the lesser eigenvector K. Then gdcF = 


Let us now return to Figure 4.3. As one might expect, the gradient points toward the 
top the the ridge on. each side of it, but must swing through it as one passes over the 
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Figure 4,3*. .Gradient and curvature vectors In the Vicinity of t he rising F2 in Figure 
2. f,5. The solid vectors depict the gradient dj"rcction h and the dotted vectors depict 
the direction of greatest downward curvature, (The vector lengths are normalized 
to nnlty .7 


top. The direction of greatest downward curvature-, however,, points perpendicular 
to the ridge in its enLire neighborhood, since a, surface will curve downward more 
sharply as one moves toward and away from the top of a ridge then if one moves 
along It. Note that t-he two hinds of vectors will become perpendicular precisely on 
the top of the ridge. 
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We define the ridge top as the locus of points that satisfy 

VF ^ gckjF =?0 and k < 0 , ( 4 . 3 . 2 ) 

where k, ir the minimum second directional derivative- The inner product of these 
vectors is zero precisely when they are perpendicular, and k < 0 insures that the 
point ie a ridge top and not a trough bottom. 

Wc now show this definition is equivalent to moving along lines of curvature on 
F[U f) Corresponding to the greatest downward curvature and noting passage through 
a peak on that surface. This gives an intuitively simple interpretation of a ridge 
top, and shows that gdc F essentially provides the local ridge direction. 

Let 3 : ffi -* 3i 2 be a parameterized, differentiable curve with ^(s) = gdc .F(g[a)). In 
other words n g traces out a curve in the time-frequency plane that is always tangent 
to the direction of maximum downward curvature. When Fog goes through a 
peak, = 0. Qy the chain rule, this occure precisely where VF ■ ^(s) — 

VF 1 gdcF = 0. If it < 0* the curve goes through a maximum, t But this is just 
our ridge top definition, Eq. 4.3.2, as desired, 

The inner product in F.q, 4,3.2 is e-asy to compute for each point on these time- 
frequency surfacee (one only needs the first and second derivatives of the sur¬ 
face, which are simple to compute for such a smooth surface). Since this quan¬ 
tity may vanish in between sample points in a digital 5mple-mentation, we detect 
seinKnassings between adjacent sample points. 

Figure 4,4 shows, the zero crossings in this quantity for the time-frequency energy 
surface in Figure 2.15. Note that the steep formant peaks are now as well traced 

t This assumed is ncgigible; (■F* jr)* r (s) =■g l (a] + VF 1 ^(j), where it equal* the first 

term. 
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Figure 4,4„ TwcFdirnejLsIoiaaj ridgs computation applied to the 
energy surface in Figure 2 . 15 - The contours are those points where the gradient 
direction and direction of greatest downward curvature arc perpendicular. This 
computation captures the steep limis-frequency ridges, due to rapid formant motion, 
as well as the more horizon iaJ ones. 


AS the stationary ones by this ridge top computation, The only thresholding per- 
Form&d here is the removal of points below Lhe signal-to-noise ratio of the analysis, 
Thus, fairly tow amplitude structure can appear In addition to the significant time- 
frequency ridges. We will examine in Section 4.6 how we to deal with such clutter. 

A few pertinent details have not yet been mentioned. First, to perforin this compu¬ 
tation, all aspect ratio baa to be chosen between time and frequency, since it is naL 
invariant, under different relative scalings of time and frequency. The choice ls nat- 
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uraE^ we use the scaling inherited from the energy representation: let f =r ( 

Thus, we perform our computations in the new coordinates, /). 

Second, very high spatial frequencies have been removed from the energy represen¬ 
tation already. Very low spatial frequencies also appear in the vertical direction, 
due to amplitude variations and formant motion. We find better results when these 
are also removed by filtering; we thus use a smoothed and flattened energy surface 
for the ridge computation. 

4*4* Time'frequency ridges — directional kerne] 

A second approach to the problem of identifying time-frequency energy ridges uses 
directional kernels. Let 0} be a family of time-Frequency representations of 

the class defined by the kernel m Eq. 2.h.@, where 8 givea the preferred direction of 
the transform (ie. T the kernel orientation)* and the other free parameters* n x and 
*Tt* are fixed. We would expect in the vicinity of a time-frequency ridge and for fixed 
I and /, F(f./;fl) would be maximum when 8 equalled the local ridge direction 0$; 
in other words, when the transform’s orientation is tuned to the local direction of 
the energy ridge. We would also expect that Fftfs),/^),^] would be maximum 
at the ridge top, where [«{*),/(*)) is a curve that Creeses the ridge perpendicular 
to its trajectory. The first case corresponds to a maximum under rotation of the 
kernel; the second case corresponds to a maximum under translation of the kernel 
along the minor axis of its concentration ellipse (see Figure 4,4). 

The locus of points where these two maxima coincide defines a curve in the time- 
frequency plane, which wc can take as our ridge top definition. That is, we seek the 
paint* that satisfy both 


±F(t,f;g)=0 


(4.4.1 a) 
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Figure A. r 5. Two conditions for rrdge detection: (a) JocaJ maartmum under kernel 
rotation, arid (b) focal maxim u m under kernel translation along minor axis. 


and 


— VF + {ain&t -cos 9} 

dF ■ 4 3F 4 
= -^rrstn V - —cos & 

dt Iff 


= 0, 


(4.4.1*] 


This computation can he implemented by calculating and cm a su!- 

ficiently fine grid of samples of and then finding the simultaneous Kerch 

Crossings sn the left-hand aides of Ecp 4.4.1a and Eq, 4,4,1b, (The signs of the 
2 trocrossingR have to be examined to insure that we have maxima and not min- 
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ima.) 

We yet have to Specify the scale parameters <Jy and Alternatively, we cart 
specify (r s and r — aifo^. We can interpret oj as the size parameter and r as an 
eccentricity parameter, since the .greater the value of r, the greater the eccentricity 
of the concentration ellipse for the kernel (when holding crj. constant]. 

I he choice of t depends on a tradeoFF. Clearly, as r increases, time-frequency locality 
is sacrificed. In particular, bends in the time-frequency trajectory of an energy ridge 
are poorly resolved with larger values of r. 


On the other hand, larger values of r have an advantage in separating intersecting 
energy ridges, since the larger values of r give better selectivity to a particular 
orientation. We can quantify this selectivity as follows. 

Consider LIie response of the transform at a frequency f ^ to a complex exponential 
of frequency / 0 . The value is independent of /<j and equals the value of F a (0,0; 0,r) 
when x(t) =• 1 (i.e,, f 0 — 0 ). We can therefore define a tuning curve r(^r] = 
^(0,0; U,r] that indicates the selectivity of the transform tern el to different values 
of the orientation and eccentricity parameters. 


It ss strai gilt-forward to show that 

r(^r) K 


(4.4.4) 


v'T+F^ijW 

In Figure 4.6 this tuning curve is plotted as a function of 0 for several values of r. 


Even greater orientation selectivity can be obtained if we modify this ridge top 
Computation, The idea is simple; instead oF maximizing the energy, J, for 

various & in Eq. 4.4.1a, we can maximize a more directionally selective measure, such 
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Figure 4.6. TuufjtJ^ curves showing dir actional se/ectivfty of gaussian trans-form 
kernels. 

&k amount of curvature. It* particular, me minimize the second directional deriva¬ 
tive perpendicular ta the kerne] orientation. But this is equivalent to maximizing 
the energy of the transform that uses the modified kernel 

in other words we use a modified Gaussian kernel in the computation specified by 
Eqs, 4.4,la s b, This new kernel has a central 'excitatory' region with ‘inhibitory 1 ' 
flanks that give greater orientation selectivity Sec Figure 4,7, 

The tuning curve for this modified kernel has the form 


oc tfoa 3 0r 3 (0 h r). 


(4.4.5) 
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Figure 4.T. Transform kernel j(t, /) = - /), where /) fe fl 2-D gtui&sian. 

This new kernel has a centra! ‘excitatory' region with 'i nhi bitory* /fanIts that give 
greater orientation se/ectMty. 


In i [gum 4.S this tuning curve is plotted as a function of 0 for several values of r. 
These indeed show greater selectivity than the corresponding plots in Figure 4S. 

it turns opt that this computation lb a generalization of the method in Section 4.3* 
Tn particular, if r == 1, then thft two computations arc identical‘ he., those points at 
which the maximum downward curvature is perpendicular to the gradient direction 
arc identical to those points where the minimum second derivative Is parallel to a 
direction of zero slope. 

We therefore see that this section is a generalisation of previous section. When 
r 1, optimal localization in time-frequency rcsultE. A* r is increased, some of this 
locality ia sacrificed for improved orientation selectivity. Thu*, a non-directipnnl 
kernel will give better results when there is only one ridge in the region, while an 
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Figure 1 4-@, Tuning curves $J]owgfifr direeijonai selectivity of transform kernels of 
ihe Form in Figure 4.6. 


directional kerne] can give better results when two ridges Cross. 

bet us examine these results on our example utterance from Section 2,9. For voiced 
speech, We choose ffa to match the pitch period, and we let r > 1. Then the 
pitch will be suppressed m each of the F’(t, w;0) f using the results of Chapter 3- 
In Figure i-9 f we show the ridge top analysis nn our utterance using the kernel or 
Figure '1.7 with r = 2 and r = 3. The cam t — 1 was shown in Figure 4.4. We see 
that a less directional kernel {a smaller value of rj gives better performance In the 
neighborhood of isolated formants, while a more directional kernel (a larger value of 
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f) gives belter performance ift regions where two formants ‘cross* (see Kuhn [1975] 
for a diECUSaion on the 'crowing 1, of formants in natural speech,), 

4.5, Signal detection and ridge identification 

The preceding sections have been based on heuristic arguments. Can ridge identl- 
ficaton be formulated as a. problem in optimal signal detection? We examine this 
question in this section. Let us begin by making runts particularly simple assump¬ 
tion* for ease of argument. We assume that the received 2-D signal representation 
consists of a 2-D deterministic function which depend? on the 

unknown continuous function -yf() p plus additive white 2-D Gaussian noise. The 
problem is to estimate 7 (t) f which models the path of an energy concentration in 
time-frequency. We further simplify the problem by assuming that 5(i,w) a which 
models the energy ridge, has the form 

*> y'i -I- - -fit)}. (4.5.1) 

Tn Other words, it L& a 2-D smoothed (I.e, 3 broadened) curve (the square root factor 
normalizes! the impulse for a unit step in arc length), 

in a straight-forward 2-D generalia-ation of the derivation of a matched filter [see 
Van Trees 19G8], the maximum log iikeiilnod estimate of y(f) is proportional to 

A b(*)J = 2 J J Sfaunifydtdu- f J [S(l T w^(t))] a dtdw. [4.5.2] 

Substituting Rq. 4,5.1 into Eq. 4,5.2 and changing the order of integration give? 
4 hWl = 3/ l/l + <» - jj |S(l.t;;-,(t))] ! dtdu, (4,5.3) 

where F - F**G, The first term is essentially a 2-D matched filter in which 
the convolution J 1 ** G !r matched to the signal shape. The second term takes 
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(b) 



Fj^urt 4.S. Ridge top analysis of fwioi/ using the directional ArerneJ of Figure 4.7, 
(el) r — 2. (b) r — 3. The more directional kernel? give tel tor performance where 
ridges intersect, but worse petbrmance at rfarp bends. 
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into account the energy of tile deterministic signal. The path - 7 (f) that maximizes 
Eq. 4 . 5,3 is the maximum likelihood estlmate- 

Solving Eq. 4,5,3 for the best path 19 difficult. In particular, the second term is hard 
to evaluate (although it la proportional to the arc length of ^[f) when it is sufficiently 
smooth). However, an anaiysis-by-Rynthesis procedure could, In principle, ho used 
to compute it numerically. Since we have assumed 7 (t) is continuous, this becomes 
a global opdmfaifcion over t and w. This is ratlicr Like one pole analyste-by-synthesis 
with a continuity condition imposed on the poLe trajectory, 

Them is a fundamental problem with this approach, similar fce> the problem with 
pole-fitting approach discussed in Section 4.1. Because of the nan-locality of the 
optimisation, errors at one point can propagate throughout the Solution path at this 
very finst stage of the analysis, If the signal were wed modelled by Eq. 4 . 5.3 and the 
noise well modelled by additive, white Gaussian noise, then this would nevertheless 
be the best we could do. Realistically, this is not the case. In particular, the “noise* 
could include a second ridge.; one that we shouldn’t treat as noise, hut as something 
to detect also. The detection scheme, as formulated, is too global. Instead, we need 
to make It more local in the time-frequency plane. 

Consider a small dement As of arc length of the curve 7 (r), which we can rotate 
and translate in the t - w plane. If we hold Its position constant, then for sufficiently 
small As, Eq. 4.5,3 will be maximized for that clement if it is Oriented perpendicular 
to the direction of greatest downward curvature. If the element's orientation is held 
constant, Eq, 4-5.3 will be maximized for that element if one translates it in the 
direction of the gradient. Together these imply that, elements aligned on the ridge 
tops defined by Eq. 4.3.2 will locally maximize Eq. 4.5.3, in the sense that further 
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maximization requires moving along the ridge. These considerations show that 
the ridge operator of Section 4-3 provides a kind of local solution to the detection 
problem formulated here. 

4.6. Continuity and grouping 

We have seen that the ridge detection methods of the previous sections produce 
piecewise continuous contours This follows formally from the Implicit Function 
Theorem, in particular, the zeroes of a continuously differentiable function / i £ 3 —* 
£ must form continuous contours in This continuity is a desirable property 
of the description since it reflects a constraint on the underlying acoustic events 
that is nearly always valid — loosely, that their spectral content varies (piecewise) 
continuously as a function of time, For example, formatt motion Is so constrained. 
We explore several ramifications of continuity In this section. 

First, continuity helps to solve a practical problem in descriptions of this kind- The 
ridge description, as. it stands, can be cluttered with low amplitude peaks unrelated 
to significant phonetic events. If we try to discard this unwanted structure by setting 
a threshold* We would have to keep it fairly Low, otherwise we could throw out the 
baby with the bath water, breaking important contours into fragments. Continuity 
lets us use thresholding with hysteresis, which is often used in such cases ]cf. Canny 
19&3j. The idea IS to set two thresholds. Points below the Lower threshold are first 
discarded. Points that are above the higher threshold are retained, as are any points 
between tile two thresholds, provided they lie on a contouT that crosses the higher 
threshold. The result is that insignificant points are discarded without fragmenting 
more important contours. The technique can be quite effective; Figure 4-10 shows 
an example. 
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FiK l,rp 4,1 O'. Hysteresis thresholding applied to utterance fwioi/ of Section 2-6. 
(a) Two- dimensional ridge tops. Amplitude of the ridge top is depicted by the 
width Of the contour, (b) Hysteresis thresholding of *.(*)•. This removes isolated, 
low ajnpjjtude 1 pc'jjiil.H without fragmenting [he more significant •contours'. 
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One may argue that any kind of thresholding in a mistake, since unrecoverable errors 
can be made. Instead., one should simply carry along the relative amplitudes and 
strengths oF the various points In the descriptions, and subsequent processing tan 
take thEEE weights Ento account. This Is, in principle, safer, but pratkally it is much 
harder to think about processing a cluttered, weighted description than one that 
has been first cleaned up. So that the problem does not become too unwieldy at 
this stage, it is best for now to proceed with a cleaned up description. 

Continuity' plays an important role in another problem — labelling.. Our goal is 
to eventually be able to label the points in the description with their acoustic 
correlates, e.g., formant identification. This problem would be greatly amplified 
if a whole contour could receive a single label. For example, suppose poinLs atong 
the two contours in, Figure 4.11 are competing for labelling as F2. If the points are 
sampled every 5 msec, then the points ill a 50 msec stretch can be labelled in 2 1 " 
different ways. If each of the contours, however, ia known to have a single acoustic 
correlate, then there axe only two possible labelings. 

This Is a simple point, but it is almos.t universally overlooked. The usual approach 
has been to label individual points in a Epectrum, and then either ignore continuity 
altogether, or use it to narrow the range of candidate labethnga after the fact. The 
latter approach leads to a combinatorial explosion of possible labellings. Algorithms 
such as dynamic programming can be used to make this approach more manageable, 
but then the effect oF even a single error can be catastrophic. A more direct approach 
is to first Identify Stretches of contour that will receive a unique label, with each 
deemed to have a single acoustic correlate. 


How can we identify such ^atomic* contours? ideally, our initial analysis would only 
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Figure 4.11, Two contours competing for fab tiling as F2. (a) One of 2™ possible 
labellings of 50 msec stretch when a new label nan be assigned every 5 msec, (b) 
One of two labellings when w.hole contours receive a aingJe label. 

return such contours. Acoustic events would never be merged into ft single contour;, 
but would always be resolved as separate. I do not believe such a ‘‘perfect” analysis 
is possible. It rs evidently possible to fool our auditory system on this account. 
Consider the spectrum of an /if in Figure 4l2a. By low pass filtering, the spectrum 
can be tilted to appear as in Figure 1.12b. This will be perceived as an /u/; the FI 
of the f\f is taken as both Fl and F2. Conversely, an fuj can be high-pass filtered 
to sound bite an /i/ T with F1+F2 being taken as Fl. 

Listeners seldom make tliese kind of mistakes with more natural utterances alLered. 
l>y this kind of filtering, Ihis is because they hear them En context, with continuity 
being an important contextual cue. For example, consider Figure 4.13, which shows 
the spectrogram of/wi/. The fif in Figure 4.13 was taken from this utterance. If 
the entire /wi/ is low-pass filtered in the imiiner of Figure 4.12, it is perceived as 
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Figure 4.12, Turning an /if into an /a/, (a.) Short-iime spectrum nf an /if, 
(b) Low-pdas filtered /if. This will be perceived as an /u/, in other words, FI in 
perceived as FI+F2- 


/wi / 1 and not as fwuf, Similarly,, a high-pass filtered /yn/ will not sound like it 
ends in fif. 

There are two point? to be learned fronn these examples The first h that it is prub- 
ahly not possible to always separate distinct acoustic correlates of nearby energy 
concentrations locally, Le., they can be merged if heard in isolation. The second 
point is that more global constraints, such as continuity, can resolve these mergers. 
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figure 4-13, Spectrogram of /wi/. When this utterance is low-puss filtered as 
in Figure 4J2, it is still perceived as /wi/. Continuity of the formants allows the 
correct perception. 


The ridge description will represent sufficiently close formants with a single ridge, as 
in Figure 4.14, When the formants merge, one of the contours terminates, and the 
other continues on. When the formants split,, a new contour appears, while the old 
contour continues on. Evidently, some contours can change their label along their 
length. For example, the contour In Figure 4,14 that begins as FI+F2 becomes 

spiits into FI and F2. Obviously, we can not label whole contours with a single 
label always.. 

We, can, however, label portions of contours between splits and mergers with a 
single label Said differently, if we identify the locations of splits and mergers, 
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Figure 4 + 14, Merged formaate, (&} Wideband spectrogram of utterance *why 
am", (b) Ridge tops. When Fi and F2 approximate, their ridges merge. 


we can break the contours into a set of “atomic* contours, in the ketisg that each 
contour will receive a single labelling. Since mergers arc sparsely distributed in 
time-frequency, we will still have a small, manageable act of contours. 
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The idea, then, is to augment our representation to include the locations of splits, 
mergers, and crossings of contours, Identifying these junctions wifi serve two pur¬ 
poses. First, contour segments away from, them can receive single labels along their 
length, Second, the junction itself can embody continuity constraints, since the 
junctions must be consistently labelled. For example. If two contours enter a junc¬ 
tion and one leaves it ., we may label the exiting contour with the union of the labels 
of Lhe entering contours. 

This is somewhat reminiscent of the junction labelling problem in the block* world, 
Perhaps an efficient algorithm to propagate these constraints Oil! be found for for¬ 
mant labelling an Waltz- jl£7$l found for the blocks world. The problem here is 
greatly complicated by the fact that there can be many kinds of errors, e.g., a for¬ 
mant can be “missing*. Further, other factors such as spectral balance rnusL be 
taken into account. We will not attempt any labelling here. Instead, we provide a 
description of the signal that is a reasonable step toward that goai. 

Provided the ridge description is not too cluttered, which is the rule once low 
amplitude contours have been removed, the identification of contour junctions is 
relatively easy. In Fact, using the proximity of contour endpoints to other contours 
ii a simple method. Two nearby endpoints define a two point junction. Three 
nearby endpoints or a single endpoint near the body of a j tut her contour define a 
three point junction and so on. Figure 4,15a shows junctions identified by such 
proximity rules, Contours, that hoth enter and lea™ a junction are broken there, 
while two point junctions can be bridged provided that simple "good continuation 71 
rules art satisfied- The result is a set of contours that are likely to have unique 
labels of their acoustic, correlates along their length. Figure 4.15b shows points 
where contours are broken based on these junctions. 
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Figure 4rl-5, Contour junefiojr^ Jocsied. fa) Ridge tops <?f /whif with junctions 
identified by simple proximity rules, (b) Dots show points where contours Eire bntjAeu 
bused on Ibesc junctions. 
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We have shown that the above analysis in Rome circumstances can produce a more 
reasonable Sciiematisatton or the speech signal than, for example* LPC analysis. We 
will give many more examples of this analysis in the next chapter. Does thin mean 
that, the ridge analysis is uniformly better than LPC analysis In Speech applications? 
The answer is no. The simplicity and speed of the LPC algorit hms make them 
attractive for many applications Further, such pole-fitting models do work well 
in many cases. Since they embody additional constraints compared to the raw 
ridge analysis, they wit! usually not make the ‘mistake 1 of merging nearby formants 
together. Further, insignificant peaks usually do not affect the pole placements, 
This means that in clean, Umutllsad, quasi-stationary male Speech LPC analysis 
can be quite good. In such Cases, the ridge analysis may nevertheless merge nearby 
formants together and may include additional ridges, making that analysis appear 
inferior to the LPC analysis. 

This probably means that the ridge analysis will offer no improvement in simple 
speech engineering applications to the widespread LPC methods. Frankly, the power 
and importance of the ideas presented here comes only when one asks the question: 
What methods will he appropriate for speech analysis in general* natural settings? 
Under such circumstances, the transmission channel will often be imperfect and 
varying (e.g , walking down a hallway with open doors), there can be environmental 
sounds and nasalization present, and tiiore can be significant non-stalionarity. In 
these cases, the very constraints (Lt, all-pole, quasi-atationary model with a fixed 
number of poles] that make the LPC technique work so well for ‘clean? speech can 
cause it to fail in these new circumstances, producing bizarre pole positionlnga. On 
the other hand, the ridge analysis, a more conservative technique that make* no such 
assumptions, will still product a reasonable schematiEatlon of the time-frequency 
surface. A simple demonstration of these ideas is given in Section &,6 below. The 
key idea is that strong commitments to the origin of the signal are not made at the 
level of the schematic spectrogram. It is only after the ridge tops, and undoubtedly 
other feature* such as time-frequency edges, temporal discontinuities, and spectral 
balance information have been made explicit will articulatory constraints and such 
be brought to bear in this more general, least comittment approach. 







Chapter 5. 

A Catalog of Examples 


In this chapter we will apply the methods of the previous chapters to a variety of 
examples.. This will help U9 evaluate the strong points as well as the shortcomings 
of the ideas presented, The ultimate test can come only when these idea* are 
applied in a recognition scheme. This, however, has not been realised because of 
the many different components that need to be added, as indicated earlier. At this 
point, evaluation most be based on any intuitive appeal of the ideas, and on the 
performance on various examples, Given that the gnat La to essentially '‘schematize’ 
the information seen In (the sonorant regions of} a spectrogram, an obvious test 
ia to see how reasonable: the computed description looks when compared to the 
spectrogram. Given that previous approaches perform poorly in specific ton texts 
(see Figure 4.1), clear improvements will he apparent 

This situation ia similar to ed^e detection in image analysis. The typical way to 
evaluate an edge Under is to Look at its output compared to the image and ask how 
good it Looks. Perhaps a better test would be to ask how useful an edge finder 
output is, say, when applied to some scheme for finding surface discontinuities, or 
stereo depth. But such a test requires confidence in the validity of the subsequent 
processing, since a bad application of a good idea can perform more poorly than a 
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In Section 5.1, we will look a£ some general example sentences. In the following 
sections, we examine several traditional problem categories in speech analysis: in 
Section 5.2 T we look at semivowels and glides; Section 5.3 nasalized vowels; in 
Section 5.4, consonant-vowel transitions; in Section 5.5 female speech. In Section 
5 A we look at some examples of the effects of different transmission channels on 
the analysis, 

5*1 Some general examples 

The first four figures of this chapter show the sentences, “May we all learn a yellow 
lion roar,™, “Am we winning, yet?*, "We were away a year ago. 14 , and “Why am 
I eagerf spoken by adult mates. These sentences were chosen because of their 
high proportion af so no rant regions and their variety -of formant motion, We show 
wideband spectrograms and the ‘ridge* analysis of the previous chapter for each 
of these utterances. First notice the generally good agreement between the time- 
frequency ridges seen in the spectrograms and those computed by the ridge analysis; 
the latter description is a reasonable partial ‘sketch 1 of the former. This Is true even 
in the steeper formant regions, such as the various fw/’s and /j/'s in these examples 
and at the velar pinch in Figure 5.4 at .75 seconds. 

It is important to emphasize that these are not formant tracks, but ridge locations 
in the time-frequency surface. For example, when two formants come close enough 
to merge, aa in the /wi/ in Figure 5-1 (between .2 and .3 seconds and about 2100 
HaJ or a portion of the />/ in Figure 6,4 (between .S5 and .9 seconds and 2000 Hz), 
only a Single ridge is found, (The analysis notes by solid dots the locatiojis that 
contours should be broken because of possible mergers (cf. Figure 4,15), which can 
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There are also ridges present, that arc not due to the oral for manta. For example, the 
ridge in Figure £.4 between .15 sec and .55 sec and at about, 200 TIk in attributed to 
nasalization from the /in/. Viewed as a formant tracker this is a failure, but viewed 
as a ridge detector, this is a success. The nasal resonance Is strongly present in the 
signal in this region and le correctly identified by the analysts, ft IS properly left- 
to subsequent processing to sort Out which ridges are due to formants and which 
are due to other sources. This is quite different from the LPG analysis, where the 
presence of nasalization often causes sporadic and bizarre placement of the pole 
Locations (Figure 4.1), In that owe, subsequent processing would have difficulty 
sorting out the situation. 

Finally, there are various missing formants. This particularly true far F3 when Y2 
is quite low as in the /w/in Figure 5.1, In these circumstances, F3 is driven down 
by the tail of F3, and is not really visible in the spectrograms either. We know 
where F3 Is by contest, but its ti me-freqiiency ridge has essentially been driven into 
the noise. 
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Figure 5.4. 
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In 6hlS section WC show exampieR. of /w/’s, /j/\ /r/*£ and /i/'s S, The /wps and 
/i/'s are syllable initial In the context of /w ;/ and /juf in Figure 5.5 and Figure 
5,6, respectively. A range of speech rates from slow to rapid Is shown that gives a 
range of F2 formant slopes from gradual to steep. Note the ridge analysis Is fairly 
insensitive to this parameter. 

The /}/ t b in Figure 5.7 are syllable initial, with one example for each of the cardinal 
vowels, ///, /ae/, /a/, and /u/, The /r/*s in Figure 5.S are in the context V/r/V, 
where V ranges over /J/ + /ae/ T /a/, and /u/. These too show some rapid formant 
motion that is well captured. 

5.5. Nasalized vowels 

Figure 5,9 shows syllable initial nasal^ed vowels in the context V/n/, The vowels 
range over /i/, /ae/ n /a/ 5 and /u/. The main feature of this analysis is that addi¬ 
tional ridges are introduced d.uc to the nasal 'formants 1 . As mentioned earlier, this 
contrasts with the pole-fitting methods, which produce erratic results in nasalized 
vowels (Figure 1), 
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/iri/ 


fmn&f 


Figure ft/' 5 ^ ri various vowel contexts, (cullt’d,.*) 
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/ara/ /uru/ 


Figure 5.8 [cont’d)* fr/*s in various vowci contexts. 
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In this section we khow examples of consonant-vowel transitions. Figure 5,10 
through Figure -5.12 show syllable initial cd nsem l an t- VOW el transitions. The Cuil- 
sonants range over the voiced stops /!?/, /d/ p and /g/ and the vowels range over 
/l/\ /ac/, /&■/, and /u/. The. Analysis is shown only after the consonantal hurst since 
the ridge analysis is inappropriate and peculiar in the burst region. The bursts were 
located by hand in these examples. Figure 5.13 shows more rapid formant motion 
with the examples /bif in the context /tvbif and /dw/in the context / tidwf. 

The ridge analysis brings out formant motions consistent with the locus theory of 
consonant perception. This theory states that one of the cues to the perce-ption 
of consonants is the trajectories of the formants at the transitions Liberman, et 
al 1954], For example* tn many vowel contexts for adult males* F2 will have a 
trajectory out the consonant that has a locus near about 1200 He for labials, {e.g.* 
/b/)* about LS00 Hi for alvcoiars (e.g,* /df) t and above 2000 Hz for velars {e.g., 
/g/), This cue is used in spectrogram reading, hut has been hard to exploit tn 
automatic speech analysis, because of unreliable formant detection at the often 
highly non-stationary consonant’vowel transitions. 

The analysis here is better behaved, capturing rapid formant ridges as well as 
shallow ones at the transitions. As noted earlier, however, when the formants 
approximate a single ridge is produced, The F3 ridge is also sometimes lost near 
the transition Fur this speaker; in these cases, F3 appears somewhat difTusc and 
hard to locate in the spectrograms also-, These issues* as well as how to locate the 
burst, wit] present difficulties for automatic consonant detection. 
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Figure &. 10+ Syllable initial fbi >, 
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Figmv 5-13, Itvpiil hruinut [rutmittivnx. \nl h\ in tix-i-w tcxi \ni>i . (hi dn 
in thf context. -iidu . 
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Higher pitched speech, such as female and children^ speech, present the problem 
tint the harmonics of the (voiced) excitation are fairly widely spaced, via, a few hun¬ 
dred H-crii or more. This means that in a quasi-stationary analysis, the spectrum La 
less frequently sampled than for lower pitched speech, resulting In poorer estimates 
of the vocal tract transfer function (ef. Figure 3.2). Viewed two-dimensionally, the 
situation it more symmetric. For example, a* the frequency of ftn impulge train 
ie increased, the frequency spacing of the impulses in its time-frequency autocor¬ 
relation function (Figure 3.3) will increase, but their time spacing will decrease. 
Thus one will have poorer frequency Sampling 1 of a time-varying transfer function 
excited by this impulse train, but better time ‘sampling'. 

The analysis presented in Chapter 3 exploits this fact by matching the time-frequency 
window to the pitch. Higher pitched speech requires a window at a larger frequency 
scale hut at a lower time scale than lower pitched speech. The remaining analysis 
proceeds as before. Figure 5,14 gives an example with rapid F2 motion. Figure 
5.14a shows a wideband spectrogram of the nonsense utterance /viumif from an 
adult female, Figure S.l4b shows the ridge analysis using a time-frequency window 
matched to a 200 Hz pitch, 

^tc that the FI ridge and the steep F£ ridge arc well resolved. Where F2 and F3 
approximate, however, only a single ridge is found, Such mergers in the analysis 
are more common in higher pitched speech due to the greater frequency smoothing 
required. However, since less time smoothing Is required than for lowee pitched 
speech, transient clfects should, in principle, he better resolved. 
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Finally, we consider the effects of imperfect transmission channels on the analysis. 
In particular, we will consider the effects of passing the speech signal through some 
Simple LTI filters. While the examples we give are idealized, natural environments 
can give rise to many kinds of transmission channel characteristics- In general, 
human listeners can tolerate a wide variety of alterations to a speech signal ami 
have it remain intelligible see Lickiider A: Miller 1951 for a good review , That is 
not to say one la unaware of the modification;; e-g, n a pronounced room resonance 
adds a 'hollow* quality to the speech, but it does not destroy its intelligibility. 

Figure 5.15 shows the frequency response of the transmission channels we consider. 
Figure 5.15a consists of a single pole at 1500 Hz of 750 Ha bandwidth, Figure 
5.15b consists of a single pole at 1500 Hz of 150 Hz bandwidth, and Figure 5,15c 
consists of a pole-zero pair -• both are at 1500 Hx, the pole has 1000 Hi bandwidth 
while the zero has 150 Hz bandwidth. Thus, the first channel consists of a fairly 
broadband, but non-uniform channel; the second channel emphasizes the signal 
energy i n the neighborhood of 1500 Hi; and the third channel removes signal energy 
in the neighborhood of 1500. 

We show the eFfects of these transmission channels on the analysis of the utterance 
/wioi/ from Section 2.0. Figure 5.16a shows the wideband spectrogram of this ut¬ 
terance passed through the lirat channel, and Figure 5.1 ft In shows the corresponding 
ridge analysis. The effect of this broadband channel is minor when compared to 
the original analysis in Figure 4.10. Figure 5.17a shows the wideband spectrogram 
of the utterance passed through the second channel, and Figure 5.17b shows Lhc 
correspond mg ridge analysis. The eEfect of this narrowband channel is to add an 
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additional ridge at 1500 Hi. Finally Figure 5.18a shows the wideband spectrogram 
of the utterance passed through the third channel, and Figure 5.1&b shows the cor- 
responding ridge analysis. The effect of this narrowband ‘notch 1 is to put an energy 
trough in the time-frequency surface, with the F2 ridge being partially cancelled 
in the vicinity of this notch. Compare this analysis with the LFC analysis of this 
filtered utterance shown In Figure 5.18c (using the same analysis parameters as in 
Figure 4.1). We see there that the notch filter plays havoc with the LPC analysis, 
since the zero ties outside the scope of ita alt-pole model. This is analogous to the 
effects of nasalisation on LPC analysis. 
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Figure 5 , 14 . /uiifiui/ uttered by an EidWt female. (&) Wideband spectrogram, (b) 

Ridgf* analysis. 
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Figure 3.15. TVansmissjon cAulm/^ fa) r50 Hi tauidwidtA pole a£ 1500 Bz fb) 
JSO Hz bandwidth pole a$ 1500 Hz. (e) Fole^zero juair at 1300 Hz of 1000 Hz- and 
150 Hx bandwidth, respectively. 
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Figure 5 H 1G. /wioi/ passed througl] trjumifasjeii thaiuiei in Figure SlJ-5s (broad¬ 
band filter), (a) Wideband spectrogram* (b) Nidge, analysis. 


(a) 



* iu 12 i.j n 



Figure 5+1?. /wioi/ passed through traemrssroe channel In Figure 5.15b (narrow- 
hand filter), (a) Wideband spectrogram, (h) Ilidge analysis. 
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£ 5.18. fwioi/ passticf through tranmi&sion c/jaurrej 1 in Figure 5,j5c (natch 

fitter)- {*) Wideband spectrogram, (h) Ridge b riRlysis. (c) LPC MiJras. 
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